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Preface 



Welcome to NLDB04, the Ninth International Conference on the Application of 
Natural Language to Information Systems, held at the University of Salford, UK dur- 
ing June 23-25, 2004. NLDB04 follows on the success of previous conferences held 
since 1995. Early conferences then known as Application of Natural Language to 
Databases, hence the acronym NLDB, were used as a forum to discuss and dissemi- 
nate research on the integration of natural language and databases and were mainly 
concerned with natural language based queries, database modelling and user inter- 
faces that facilitate access to information. The conference has since moved to encom- 
pass all aspects of Information Systems and Software Engineering. Indeed, the use of 
natural language in systems modelling has greatly improved the development process 
and benefited both developers and users at all stages of the software development 
process. 

The latest developments in the field of natural language and the emergence of new 
technologies has seen a shift towards storage of large semantic electronic dictionaries, 
their exploitation and the advent of what is now known as the semantic web. Infor- 
mation extraction and retrieval, document and content management, ontology devel- 
opment and management and natural language conversational systems are becoming 
regular tracks in the last NLDB conferences. 

NLDB04 has seen a 50% increase in the number of submissions and has estab- 
lished itself as one of the leading conferences in the area of applying natural language 
to information systems in its broader sense. The quality of the submissions and their 
diversity have made the members of the program committee work more then usual. 65 
papers were submitted from 22 different countries. 29 were accepted as regular pa- 
pers, while 13 were accepted as short papers. The papers were classified as belonging 
to one of these themes: 

• Natural Language Conversational Systems 

• Intelligent Querying 

• Linguistic Aspects of Modeling 

• Information Retrieval 

• Natural Language Text Understanding 

• Knowledge Bases 

• Natural Language Text Understanding 

• Knowledge Management 

• Content Management 

This year we were honored by the presence of our invited speaker Fabio Ciravegna 
from the University of Sheffield, United Kingdom. His lecture on “Challenges in 
Harvesting Information for the Semantic Web” was highly appreciated and initiated 
vivid discussions. 

We are very thankful for the opportunity to serve as Program Chair and Confer- 
ence Chair for this conference. However, the organization of such event is a collective 
effort and a team work. First of all we would like to thank the members of the Pro- 
gram Committee for the time and effort they devoted to the reviewing of the submit- 
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ted articles and to the selection process. My thanks go also to the additional reviewers 
for their help and support. We would like to take this opportunity to thank the local 
organizing committee, especially its chairman Sunil Vadera, for their superb work. 
We would like to thank Nigel Linge the head of the School of Computing Science and 
Engineering, Tim Ritchings the head of the Computer Science, Multimedia and Tele- 
communication discipline and Mr. Gary Wright from the External Relations Division 
for their help and support. 

Obviously we thank the authors for their high quality submissions and their par- 
ticipation to this event and their patience during the long reviewing process. 

June 2004 Farid Meziane 

Elisabeth Metais 
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TV- Anytime Information from Mobile Devices 
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Abstract. The TV- Anytime standard describes structures of categories of digi- 
tal TV program metadata, as well as User Profile metadata for TV programs. In 
this case study we describe a natural language model and a system for the users 
to interact with the metadata and preview TV programs stored in remote data- 
bases, from their mobile devices contrary to their limited configurations. By the 
use of the TV- Anytime metadata specifications the system limits greatly the 
possibility for ambiguities. The interaction model deals with ambiguities by 
using the TV- Anytime user profiles and metadata information concerning digi- 
tal TV to rank the possible answers. The interaction between the user and the 
system is done by the use of a PDA and a mobile phone with metadata infor- 
mation stored on a database on a remote TV- Anytime compatible TV set. 



1 Introduction 

The number of digital TV channels has increased dramatically the last few years, and 
several industrial sectors and content producing sectors are active in defining the 
environment in which the TVs of the future will operate. 

The TV-Anytime Forum is an association of organizations which seeks to develop 
specifications to enable audio-visual and other services based on mass-market high 
volume digital storage in consumer platforms - simply referred to as local storage [1]. 
These specifications target interoperable and integrated systems, from content crea- 
tors/providers, through service providers, to the consumers and aim to enable applica- 
tions to exploit the storage capabilities in consumer platforms. The basic architectural 
unit is an expanded TV set (known as a Personal Digital Recorder - PDR) capable of 
capturing digital satellite broadcasts according to user interests as they are described 
in his profile and storing them into large storage devices. The current TV-Anytime 
standard specifications define the structures for the metadata that can be used to de- 
scribe TV programs and broadcasts, as well as for the metadata that can be used to 
describe the user profile. Expanded versions of the TV-Anytime architecture foresee 
also last mile TV-Anytime servers, Internet connection of the TV set and mobility 
aspects. Mobile devices (mobile phones, PDAs, etc.) in the TV-Anytime architecture 
can be used by a user to communicate with the home TV set not only for viewing TV 
programs, but also for managing the contents of the TV set (like previewing its con- 
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tents, searching for content, deleting content that has been recorded for him by the TV 
set, etc.) and for managing his profile preferences [2], 

There is a strong need for new interface paradigms that allow the interaction of 
naive users with the future TV sets in order to better satisfy their dynamic preferences 
and access information. The usual pc-based interfaces are not appropriate to interact 
with mobile devices (like mobile phones or PDAs) or with TV sets. Natural language 
interfaces (NLIs) are more appropriate interface styles for naive users, and they can 
also support voice-based interactions for mobile devices. 

The appeal of natural language interfaces to databases has been explored since the 
beginning of the ‘80s [6], [7], Significant advances have been made in dialogue man- 
agement [3], [4], [5], but the problem of reliable understanding a single sentence has 
not been solved. In comparison to the efforts made several years ago to enrich the 
databases with NLIs which faced the prohibitive cost of dialogues to fully clarify the 
query [3], our environment is more concrete than general purpose interfaces to data- 
base systems, since the structure imposed by the TV- Anytime specifications for the 
metadata greatly limit the possibilities for ambiguities. 

The importance of natural language interfaces to databases has increased rapidly 
the last few years due to the introduction of new user devices (including mobile de- 
vices such as PDAs and mobile phones) for which traditional mouse based interfaces 
are unacceptable. Research has been published in the area of NLIs to interactive TV 
based information systems [8], [9]. A well-known problem with the NLIs is that user 
interactions may be ambiguous. Ambiguity in the NLIs is a serious problem and most 
systems proposed in the literature often lead to lengthy clarification dialogues with 
the user to resolve ambiguities [14]. These dialogues systems face the problem that 
the users often do not know the answers to questions asked by the system. Unlike the 
previous systems we do not resolve the remaining ambiguities with clarification. 
Instead we can take advantage of the TV- Anytime user profile specifications in order 
to rank the possible interpretations and present to the user at the top position the one 
with the highest ranking. 

In this paper we present a model for natural language interactions with a TV set in 
an environment that follows the TV Anytime specifications, both for the TV program 
metadata as well as for the user profile metadata. The metadata are stored in databases 
with last mile connections. The natural language interactions are used to preview 
programs or summaries of programs as well as to completely manage the metadata 
and the programs that the TV set keeps for the user. In addition we describe an im- 
plementation of this TV- Anytime compatible natural language interaction model that 
works on a PDA and a mobile phone, which communicates with the TV- Anytime TV 
set for managing its programs and metadata and also allowing the previewing of TV 
programs from the mobile device. 

The best-known dialogue systems that have been developed for digital TV and 
mobile environments are related to the MIINA project [11] and the Program Guide 
Information System of NOKIA [12]. In the context of MIINA project, a system has 
been developed for information retrieval from the set-top-box Mediaterminal of 
NOKIA. The user is allowed to insert queries for TV programs, channels, program 
categories and broadcast time, using a natural language. However, the natural lan- 
guage interaction in this model is rather simple since it is only related to the informa- 
tion provided by a traditional TV-Guide. The Program Guide Information System is 
an electronic call-in demo application offering information about television programs 
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over the phone by allowing the user to converse with the system in natural language 
sentences. This system is not based on TV-Anytime metadata structures for describ- 
ing the programs or the user profiles. The scope of the interaction does not include 
any management of the stored content except retrieval or the user profiles. The main 
differences between those systems and the one described in this paper is that the pres- 
ent system uses the TV-Anytime content and consumer metadata specifications for a 
complete management of TV programs and user profiles, and that the system uses 
additional information that exists in the TV-Anytime User Profile in order to avoid 
length clarification dialogues and help the user to get the most relevant answers at the 
top of the result list. 

In section 2 of this paper the natural language model for digital TV environment is 
presented, along with the functionality provided and the representation of the infor- 
mation that the system collects from the user’s input. In section 3 we present the algo- 
rithm for resolving the ambiguities instead of using clarification dialogues. In section 
4 there is the analysis of the system architecture and of the modules that constitute it. 
Section 5 presents the implementation environment of the system and of the applica- 
tions from the client side. In section 6 we present an example of a user’s utterance and 
the actions taken by the system in order to satisfy the user’s request. Finally section 7 
presents the results of the system’s evaluation based on user experiments and section 
8 concludes by summarizing the content of this paper. 



2 The Natural Language Model for the Digital TV Environment 

The proposed Natural Language Model allows a user to determine the rules of man- 
agement of digital TV data (programs and metadata), retrieve TV program content 
based on any information of its metadata description, express his preferences for the 
types of TV programs that will be stored, manage his selection list (i.e. programs that 
have been selected by the PDR or the user himself as candidates for recording), by 
creating his profile and modify any of the above. 

The user’s utterance is constituted by a combination of sub-phrase. The categories 
of these sub-phrases are Introduction phrases, to define the functionality, Search 
phrases, to define the TV-Anytime information. Target phrases, to define where each 
of the functions is targeting, Temporal phrases , to define phrases about date and time 
and Summary phrases, to define summaries with audio/visual content. 

The structure that represents the information gathered by the user’s utterance is 
shown in figure 1. This structure consists of three parts namely Element, Element 
Type and Element Value. The first structure part (Element) is used to differentiate the 
TV-Anytime metadata information (modeled as TVA-properties) from the information 
that directs the system to the correct management of the user’s input (modeled as 
flags). The TV-Anytime information about date and time is modeled as temporal 
Elements. The second structure part (Element Type) is used in order to further spe- 
cialize the aforementioned information and to obtain its corresponding Element Value 
(the third structure part), from the user’s utterance. When a user inserts an utterance 
into the system, it generates a feature structure [10] that follows the structure of the 
model. 
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Element 


Element Type 


Element Value 


flags 


action 


retrieve 

insert 

delete 

profile 


target 


box 

list 

profile 


temporal 


time 


1 ... 24 


Day 


Monday ... Sunday 


month 


January ... December 


year 


<YYYY> 


before 


time 


1 ... 24 


day 


Monday ... Sunday 


month 


January ... December 


year 


<YYYY> 


after 


time 


1 ... 24 


day 


Monday ... Sunday 


month 


January ... December 


year 


<YYYY> 


time indicator 


pm or am 


day indicator 


weekly 


TVA- 

properties 


genre 


clist of genres from the TV A Speci- 
fication> 


title 


string of arbitrary length 


keyword 


string of arbitrary length 


creator 


string of arbitrary length 


name 


string of arbitrary length 


country 


string of arbitrary length 


date period 


no value 


language 


string of arbitrary length 


dissemination date 


no value 


dissemination location 


string of arbitrary length 


dissemination source 


string of arbitrary length 


type 


audio 

visual 

textual 


theme 


string of arbitrary length 


format 


characters 

frames 

minutes 

seconds 


length 


string of arbitrary length 


minlength 


string of arbitrary length 


maxlength 


string of arbitrary length 



Fig. 1. The structure of the natural language model 



The ‘flags’ Element takes its value from the introduction phrases and the target 
phrases. The ‘TVA-properties’ Element takes its value from the search phrases and 
the summary phrases and the ‘temporal’ Element from the temporal phrases. 

The feature structure can contain one, two or three Elements. These three types of 
feature structure are: 

Type 1: Markers 

- E.g. I want to see what is in my selection list 





A Natural Language Model and a System for Managing TV-Anytime Information 



5 



This utterance consists of an introduction phrase (I want to see what is) and a tar- 
get phrase (in my list). The Element Type action of the Element markers takes the 
value ‘retrieval’ (information that comes from the introduction phrase) and the Ele- 
ment Type target takes the value ‘list’ (information that comes from the target 
phrase). 

Type 2: Markers - TYA-properties 

- E.g. I would like you to show me movies starring Mel Gibson 

This utterance consists of an introduction phrase (I would like you to show me) 
and a search phrase (movies starring Mel Gibson). The Element Type action of the 
Element markers obtains the value ‘retrieval’ (information that comes from the intro- 
duction phrase), the Element Type genre of the Element TVA-properties obtains the 
value ‘movies’, the Element Type creator takes the value ‘actor’ and the Element 
Type name takes the value ‘Mel Gibson’ (information that comes from the search 
phrase). 

Type 3: Markers - TVA-properties - Temporal 

- E.g. Insert English spoken mystery movies broadcasted at midnight into my selec- 

tion list. 

This utterance consists of an introduction phrase (Insert), a search phrase (Eng- 
lish spoken mystery movies broadcasted), a temporal phrase (at midnight) and a 
target phrase (into my selection list). In this case, the Element Type action of the 
Element markers takes the value ‘insert’ (information that comes from the introduc- 
tion phrase), the Element Type target takes the value ‘list’, from the target phrase, the 
Element Type genre of the Element TVA-properties takes the values ‘mystery’ and 
‘movies’, the Element Type language takes the value ‘English’ and in the feature 
structure there is also the Element Type dissemination value, but without value. This 
information comes from the search phrase. Also, in the Element temporal, the Ele- 
ment Type time takes the value ‘24’ and the Element Type time indicator takes the 
value ‘am’. This information also comes from the search phrase. 

The TV-Anytime metadata model integrates specifications for content metadata 
used to describe digital TV Programs in terms of various features and specifications 
for user preferences used to filter program metadata. These user preferences are mod- 
elled by the FilteringAndSearchPreferences Descriptor Scheme (DS) and the Brows- 
ingPreferences DS. The FilteringAndSearchPreferences Descriptor Scheme (DS) 
specifies a user’s filtering and/or searching preferences for audio-visual content. 
These preferences can be specified in terms of creation-, classification- and source- 
related properties of the content. The FilteringAndSearchPreferences DS is a con- 
tainer of CreationPreferences (i.e. Title, Creator), ClassificationPreferences (i.e. 
Country, Language) and SourcePreferences (i.e. DisseminationSource, Dissemina- 
tionLocation). The BrowsingPreferences DS is used to specify a user’s preferences 
for navigating and accessing multimedia content and is a container of SummaryPref- 
erences (i.e. SummaryType, Symmary Theme, SummaryDuration) and Preference- 
Condition (i.e. Time, Place). 

For the retrieval of the personalized content metadata, the management of the per- 
sonalized content and the creation of the user’s profile, the utterance contains in its 
body one or more search phrases. The system will create a TV-Anytime XML docu- 
ment, compatible with the Userldentifier and the FilteringAndSearchPreferences 
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Descriptor Schemes of the TV- Anytime metadata specification. For the forth system 
function, the definition of the user’s preferences for the characteristics of an audio- 
visual content summary, the system constructs a TV- Anytime XML document, com- 
patible with the Userldentifier and the BrowsingPreferences Descriptor Schemes, 
with values in the fields of the SummaryPreferences and the PreferenceCondition (for 
handling the time of the summary delivery). 

A user’s selection list is a list of information about program’s metadata that the 
system recommends to the user based on his preferences expressed either at his TV- 
Anytime profile or directly by him. Every program in this list has a status. The four 
possible values of this status are: undefined, toBeRecorded, recorded, toBeDeleted. 
So, if the user wants to manage the contents of his selection list or the list of his stored 
contents the actions that take place are represented in figure 2: 



System— 



/DeleteFromUserN 
SelectionList J 



Null 



Undefined 



X- 



( Insertlntollser A 
SelectionList J 
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f ToBe V, 
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l SelectionList J 


f UpdateUser ^ 

V SelectionList J 





-User/System - 
llser^— 



■0 



ToBe 

Recorded 

“I 

Sustem 



Recorded 



Fig. 2. The state machine for managing the status of a program in the user’s selection list 



3 Resolving Ambiguities - The Algorithm 

From the user’s utterance the system repossess the TV- Anytime category he is refer- 
ring to. If there is no such feedback from the utterance the system, by following a 
specific number of steps, tries to resolve the ambiguity. First, it collects word by word 
the sub-phrase with the ambiguities and creates a table of these words. Then -by using 
a stop list containing words with no semantic values, such as prepositions, pronouns, 
conjunctions, particles, articles, determiners and so on- it eliminates the words that 
match any word in this list. However, the system retains the existence of an ‘and’ or 
an ‘or’ for the optimum cross-correlation of the results. Then the system gathers the 
remaining words and by filtering them through its database or other existing ontolo- 
gies it returns a TV- Anytime XML document that is compatible with the Filterin- 
gAndSearchPreferences DS. This descriptor scheme is the one that is used for the 
semantic resolving of the words. Finally, the system checks for matches in any exist- 
ing TV- Any time user’s profile. The classification of the results is important in order 
to prioritize the user’s preferences. 
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Algorithm for Ambiguities Resolution 

Subroutine: semantic resolver 

for every word with ambiguity check the stop list 
if there is no match. 

check for TV A semantics 

if there are TV A semantics add a specific weight value 
if there is a user profile 

check from semantics from profile 
if there is a match 

add a weight value based on the preference value from profile to the TV A 
semantics 

if there is a match 
cut the word 

return 

check the words with ambiguities for an ‘and’ or an ‘or’ 
if there is no match 

call semantic resolver 

check for same strings that contain words with the same semantic 
cut the rest 
search for results 

rank the results based on the same TV A semantic 

else 

call semantic resolver 
search for results 

group the results based on the same TVA semantic 
rank the results based on the weight value 



4 System Architecture 

In figure 3 we present the overall architecture of the natural language system. 



reply request User 



Fig. 3. The Natural Language System Architecture 
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The insertion of the utterance is made by the use of a wireless device (mobile 
phone, PDA). Then it is forwarded to the ChartParser module. The ChartParser 
module consists of the JavaChart parser [13] that creates a feature structure with the 
information from the user’s input. There exist two lexicons, the stem lexicon, which 
contains the stems of the words used in this language, and the words that help to give 
the right values to the TV- Any time categories, and the mini lexicon, which contains 
the endings of the words. Finally, there is the grammar that follows a unified-based 
formalism. 

The Dialogue Manager acts as the core module of the system. It is responsible for 
communicating with all the other system modules, to create a proper structure of the 
feature structure so the other modules to extract the information, to deal with ambi- 
guities by communicating with the Ambiguities Resolver module, to interact with the 
server to retrieve the results and to pass the information to the Response Manager 
module for the conduction of a proper message for the user. It takes as an input a list 
of feature structures from the chart parser and the user’s data from the application that 
he/she is using. It checks for specific properties in order to eliminate the list of the 
feature structures and, in the case there are any ambiguities, it passes to the Ambigui- 
ties Resolver module the list of the words with the ambiguities. Finally, it creates a 
structure with information about the action, the target and the TV- Anytime XML 
document from the user’s input. 

The content management system architecture follows a multi-tier approach and 
consists of three tiers. The lowest tier handles the metadata management. The mid- 
dleware tier includes all the logic for interfacing the system with the outside world. 
The application tier enables the exchange of information between the server and het- 
erogeneous clients through different communication links. The Relational Database 
of the system contains the TV- Any time Metadata information, as well as a number of 
ontologies, with information concerning digital TV applications. The Relational 
DBMS manages the transactions, utilizes a Java API (implemented for the extraction 
of the functionality for filtering, retrieval and summarization) and cooperates with the 
XML-DB middleware. The XML-DB middleware is a set of software components 
responsible for the management of the TV- Anytime XML documents and the corre- 
spondence of the TV- Any time Metadata XML schema with the underlying relational 
schema. 

The Ambiguities Resolver module consists of three modules that are responsible 
for the resolution of different kinds of ambiguities. The Date/Time Resolver is the 
component that converts the temporal phrases in a TV- Anytime compliant form. The 
TV A Semantics Resolver communicates with the relational DBMS and is responsible 
to attach TV-Anytime semantics to the words with the ambiguities. We use the User’s 
Profile Resolver to help the ranking of the final results. Every user can have one or 
more User Profiles, which represents his interests. The module filters the list of the 
words from any existing user’s profile and returns a FilteringAndSearchPreferences 
XML document with values from the corresponding TV-Anytime categories. Finally, 
it passes this document to the Response Manager module. 

The Response Manager module interacts with the system's database, by providing 
it the structured information, executes the appropriate functions, retrieves the results 
and classifies them accordingly. Then, it creates a message and adds it to any existing 
result list. The user must understand from the message what exactly the system done 
to satisfy his request and get an error message if something went wrong. 
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5 Implementation Environment 

The implementation platform consists of the MySQL Platform [15]. The implementa- 
tion of the server was based on Java 2 and the class files were compiled using 
JDK1.4.1. [16]. The parsing of the user’s utterance was made by the use of the 
JavaChart parser, a chart parser written in Java. For the implementation of the com- 
munication between the server and the client, we have exploited the JAVA Serlvet 
technology in the server side by developing a servlet that acts as the interface between 
the user client and the database server or the PDR interface. This servlet was locally 
deployed for testing on an Apache Tomcat v4.0.1 server and the class files were com- 
piled using JDK1.4.1. 

Two cases are considered related to the wireless device used on the system’s client 
side. The first one is the application that runs on any Java-enabled mobile phone de- 
vice and the second is the application that runs on a PDA device. For both cases of the 
remote access application, the client establishes an http connection with the server. 

For the client (mobile device) the implementation platform consists of the Java 2 
Platform, Micro Edition (J2ME) [16]. The Connected Limited Device Configuration 
(CLDC) has been used for the limited capabilities of the mobile phone. The Mobile 
Information Device profile (MIDP) is an architecture and a set of Java libraries that 
create an open, third party application development environment for small, resource- 
constrained, devices. 

For the second implementation of the client side we used JEODE Java Runtime En- 
vironment [ 16] for the PDA device client. The JEODE runtime environment supports 
the CDC/Foundation Profile and the Personal Profile J2ME specifications that support 
implementations in PersonalJava and Embedded Java. 



6 Evaluation 

This section presents some preliminary results of the system’s evaluation. The 
evaluation has been based on a user experiment that was performed by ten laboratory 
users. All the users had previous experience using computers and graphical interfaces. 
Nevertheless none of them had ever used a natural language system before. There 
were three main tasks used in the evaluation. The first one was for the users to use the 
system to define their user profile, the second one to interact with the system by using 
a set of utterances with no ambiguities and the third one to interact with the system by 
using a set of 10 utterances with ambiguities. 

The functionality provided by the natural language system was also provided by 
the use of alternative menu-driven user interfaces for wireless devices (mobile phone, 
PDA). Based on the preliminary results it becomes clear that the end users found the 
system easier to use with the NLI than with the traditional user interfaces. The NLI 
was shown to provide an easy way to specify TV-Anytime structures without com- 
plex navigation between screens. For utterances that showed no ambiguities the sys- 
tem proved to fully exploit the structured TV-Anytime model capabilities for filtering 
in order to retrieve the qualified ranked results. 




10 



A. Karanastasi, F.G. Kazasis, and S. Christodoulakis 



In the case of the utterances with ambiguities, we have considered 100 interactions 
in total. All the utterances contained a sub-phrase with no declaration of any TV- 
Anytime categories. For example, two of the utterances considered were: 

• I want comedies with Tom English 

• Record movie from Russia with love 

In order to evaluate our approach for the use of the existing user profiles in order to 
rank the possible answers in case of ambiguities we have considered different types of 
user profiles so that the similarity between the user’s preferences in these profiles and 
the specific input utterances to be near 0,5. 

Diagram 1 shows the number of interactions per rank position for the cases that the 
system has either used the user profile in order to better rank the results or not. When 
the system uses the user profile to rank the results, we get about 90% of the exact 
results in the first 20 positions of the ranked resulting lists. This percentage varies 
according to the result list. In the first 5 positions we get the 65% of the exact results. 
When the system does not use the user profile to rank the results, we get about 80% of 
the exact results in the first 20 positions of the ranked resulting lists. In the first 5 
positions we get the 40% of the exact results. 

Diagram 2 shows the number of interactions in the top percentage of the total 
number of results. When the system uses the user profile to rank the results we get 
about 90% of the exact results in the 40% of the total number of results. In the case 
that the system does not use the user profile we get about 90% of the exact results in 
the 70% of the total number of results. 

More systematic and larger scale evaluation work is still under way. 





Diagram 1. Number of interactions per rank Diagram 2. Number of interactions per top 
position percentage of the total number of results 



7 Summary - Conclusions 

In this paper we described the design and the implementation of a natural language 
model for managing TV- Anytime information (program metadata, user profile meta- 
data) which are stored in databases in home TV- Anytime boxes or in last mile TV- 
Anytime servers. The NLIs allow the users to manage TV- Any time metadata from 
PDAs and mobile phones in order to determine the rules of management of digital TV 
data (programs and metadata), retrieve TV program content based on any information 
of its metadata description, express his preferences for the types of TV programs that 
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will be stored, manage his selection list (i.e. programs that have been selected by the 
PDR or the user himself as candidates for recording) or modify any of the above. 

The natural language model was developed to be compatible with the TV-Anytime 
specifications and manages the information of the content metadata categories for TV 
programs and the user’s profile metadata. It is expandable to future additions in the 
TV-Anytime metadata specifications. 

In order to satisfy the proposed functionality a dialogue model was developed that 
contains: Introduction phrases, to define the functionality, Search phrases, to define 
the TV-Anytime information, Target phrases, to define where each of the functions is 
targeting. Temporal phrases, to define phrases about date and time and Summary 
phrases, to define summaries with audio/visual content. 

The structures defined for the metadata by the TV-Anytime standard limit signifi- 
cantly the possibility for ambiguities in the language. Thus most queries are answered 
precisely from the underlined database system. Whenever ambiguities occur, the 
system firstly checks for specific words in the utterance, then searches existing on- 
tologies and attaches semantics to every word that appears with ambiguity and finally 
checks the TV-Anytime user’s profile to attach a weight value to the search results. 
The algorithm checks for the best cross-correlations and unifies the results by assign- 
ing the regular weight values to the results. 

The implementation of the natural language model runs on a mobile phone and a 
PDA. Preliminary evaluation studies have shown it to be a good tool for these envi- 
ronments, better than traditional PC interfaces. Larger scale experimentation is cur- 
rently underway. In addition the integration and interaction with domain specific 
ontologies related to TV programs [17], [18]. Our current research aims to show that 
natural language (and speech) interfaces are appropriate interface styles for accessing 
audiovisual content, stored in home information servers, from mobile devices. 

Acknowledgments. The work presented in this paper was partially funded in the 
scope of the DELOS II Network of Excellence in Digital Libraries [19]. 
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Abstract. Usually internet information services are based on HTML, but now 
more and more phone speech information services appear. In February 2003 
VoiceXML 2.0 was published by the VoiceXML Forum to bring the advan- 
tages of web-based development and content delivery to interactive voice re- 
sponse information services. Such services include content generation, naviga- 
tion and functionality support which has to be modeled in a consistent way. 
This document describes a SiteLang oriented specification approach based on 
media objects as well as movie concepts for story and interaction spaces. This 
results in a systematic simplification of the speech application development 
process. We have created a VoiceXML based information system prototype for 
an E-Government system which will be used as an example within this paper. 



1 Introduction 

Voice response information services differ from GUI based web services because 
they use natural speech technology and DTMF input instead of a computer with a 
web browser. In a normal man-machine conversation, the partners change their roles 
between speaking and listening to perform a dialog. 



1.1 Voice XML Based Information Services 

VoiceXML (VXML) is designed for creating audio dialogs that feature synthesized 
speech, digitized audio, recognition of spoken words and DTMF key input, recording 
of spoken input, telephony and mixed initiative conversations. VXML is a standard 
dialog design language that developers could use to build voice applications. As the 
dialog manager component it defines dialog constructs like form, menu and link, and 
the Form Interpretation Algorithm mechanism by which they are interpreted. A caller 
uses DTMF or speech as system input and gets synthetic speech or pre-recorded audio 
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as system output. Dialogs are added through the design of new VoiceXML docu- 
ments which can be extended by a web application server with database connection. 
The architectural model assumed by this document has the following components: 




Fig. 1. VXML Architectural Model 



A document server processes requests from a client application, the VXML Inter- 
preter, through the VXML interpreter context. The server produces VXML docu- 
ments in reply, which are processed by the VXML Interpreter. The VXML interpreter 
context may monitor user inputs in parallel with the VXML interpreter. For example, 
one VoiceXML interpreter context may always listen for a special escape phrase that 
takes the user to a high-level personal assistant, and another may listen for escape 
phrases that alter user preferences like volume or text-to-speech characteristics. The 
implementation platform is controlled by the VXML interpreter context and by the 
VXML interpreter. For instance, in an interactive voice response application, the 
VXML interpreter context may be responsible for detecting an incoming call, ac- 
quiring the initial VXML document, and answering the call, while the VXML inter- 
preter conducts the dialog after answer. The implementation platform generates 
events in response to user actions and system events. Some of these events are acted 
upon by the VXML interpreter itself, as specified by the VXML document, while 
others are acted upon by the VXML interpreter context. 



1.2 Problems with Voice Based Man-Machine Communication 

Although interesting speech technology exists and already runs they are also draw- 
backs and problems. 







State- and Object Oriented Specification of Interactive VoiceXML Information Services 



15 



Finances: telecommunication provider, telephony systems, speech recognition 
components, VXML browser and TTS-voices cost money 
Naturalness: some users do not want to talk to a computer 

Reliability: speech recognition components are not predictable and as good as a 
human in a call center 

Development Process: systematic approaches for voice application specification 
and documentation are just in the beginning 

Short dialogs and patience: questions and functionality should be as short and 

precise as possible because it takes times to listen to the system 

Navigation: Caller has to navigate to desired functionality and data 

Long dialogs: some users want to get a full information packet read by the system 

instead to navigate 

VXML dialogs contain only data and sentences for what they are programmed for 
so the conversation is limited 

We use of standard methods (object oriented, entity relationship), standard software 
(Java, MySQL, Apache, Tomcat) and systems (VoiceXML 2.0, OptimTalk) and we 
have developed a voice application system which is free and easy to use. Through the 
use of DTMF as primary input and speech recognition as secondary input any voice 
application is usable in many environments. 



1.3 State- and Object Oriented Approach 

A lot of approaches for the conceptual modeling of internet sites have been summa- 
rized and generalized in [ScTOO]. This and other approaches can be adapted to the 
development of voice applications. Workflow approaches try currently to cover the 
entire behavior of systems. Wegner's interaction machines [GST00] can be used for 
formal treatment of information services. Semantics of information services can be 
based on abstract state machines which enable in reasoning on reliability, consistency 
and live ness [ThaOO]. 

UML (Unified Modelling Language) is [ForOO] a standard notation for the modeling 
of real-world objects as a first step in developing an object-oriented design methodol- 
ogy. Its notation is derived from object oriented specifications and unifies the nota- 
tions of object-oriented design and analysis methodologies. After the three gurus 
(Grady Booch, Ivar Jacobson and James Rumbaugh) finished creating UML as a 
single complete notation for describing object models, they turned their efforts to the 
development process. They came up with the Rational Unified Process (RUP), which 
is a general framework that can be used to describe specific development processes. 
We use the following UML concepts according to the RUP to specify our software 
application: use case, state, object, class and package, 

HTML is [W3HTML4] the universally understood web language, similar to VXML 
and gives authors means to: 
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Publish online documents with text 

Retrieve online information via hyperlinks, at the click of a button 
Design forms for conducting transactions with remote services 

We use HTML for the textual presentation and definition of the dialogs. As a result 
the development of voice applications can be reduced to graphical web development 
under consideration of voice aspects. 

The software development starts with the specification of the story, processes and 
scenarios through requirement specification, process description and use cases. The 
software developer is free to use other modeling methods. Our use cases are based on 
story description and process description. With the help of this use cases we define 
state machines which represent the dialog and information flow. These FSMs are 
basis for the real spoken dialog, the navigation and the behavior of the application. 
Based on the dialog flow of the FSMs HTML pages are specified. As the last step of 
specification media objects will be defined and integrated in HTML. In the last step 
the HTML pages are translated into VXML pages. 

Beside we explain the specification by introducing our speech application prototype 
SeSAM which was designed to support communal E-Government purposes. SeSAM 
supports the management of users, groups, messages and appointments. 



2 Voice Application Specification 

The implementation of voice services might be rather complex, so this specification 
approach aims to simplify the software development process of a voice application 
project. The voice software development process starts with the requirement analysis 
to get the real customer needs. On the base of the intuitive main objectives and under 
consideration of the user we develop structured use cases, finite state machines, a 
dialog specification and media objects. 



2.1 Story Specification 

A story describes the set of all possible intended interactions of users with the system. 
It is often defined generally in a product specification document. Additionally the 
story specification could be supported by a business process specification which de- 
scribes the processes which should be supported by the software system. 

A brief description of our E-Government SeSAM application as a story space could 
be the following: 

Story { 

Visitor {Group Information, Public Messages, Public Political Appointments} 
RegisteredUser {Group Information, Messages, Appointments} 
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The system is used by registered users like party members or politicians and visitors 
who represent the inhabitants of a town. The users, or callers, will get information 
about members and groups, messages and appointments by phone. The proposed 
story specification method is not rigid, so other form could be used. 



2.2 Process Specification 

A process could be anything that operates for a period of time, normally consuming 
resources during that time and using them to create a useful result. 

A process model is a general approach for organizing a work into activities, an aid to 
thinking and not a rigid prescription of the way to do things. It helps the software 
project manager to decide what work should is done in the target environment and in 
what sequences the work is performed. A process specification is useful for systems 
where voice interfaces are intended to support the processes or goals of the organiza- 
tion. An employee who answers customer questions by phone could be replaced by 
an automatic voice system which offers the desired information by structured FAQ. 
Process specification costs time for analysis and design, but it is a good documenta- 
tion and a useful basis for further requirement specification. 





o 

A 

administrator 



2.3 Use Case Specification 

Based on intuitive goals, a story and a business process description we model the 
requirements of the application with use cases known from the Unified Modelling 
Language UML. 
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A use case describes a set of actions of a system watched by the user which causes 
some results. In our case it describes the set of scenarios and is started by an actor. A 
scenario is a specific set of actions that clarify the behavior of the voice software 
system and represent the basis for abstraction and specification. An actor represents 
one role outside of a system. Graphically a use case is presented though the form of 
an ellipse and is connected with an actor. It can be used by other use cases or uses 
other ones. 

We use different levels of abstraction for use case modeling. There are other use case 
diagrams for the actions coordinate, communicate and get member information which 
should clarify and describe generally the interaction between the user and the system. 



2.4 Dialog Structure Specification 

The result of the story definition and requirement analysis is a specification of the 
typical interaction of a user with the computer voice system. But this documentation 
contains no information about navigation and the dialog structure of the VXML pages 
which contains the functionality and data. Based on the nature of a discussion each 
VXML page contains the data spoken by the TTS system and data about the reaction 
if the caller says something. DTMF for the selection of functionality or data is limited 
to the number set 0. . .9 and the special characters * and #. 

We model navigation and the structure of the whole story and for each individual use 
cases with finite state machines M = (F, A, Q) where: 

1 . Q is a finite set of conversation states 

2. A is a finite set of user input symbols (user inputs like DTMF or speech or clicks) 

3. F is a function of the form F : Q x A -> Q 

Every state of the FSM is regarded as a conversational state. In a conversational state 
the systems says something, asks the user something for an answer and goes to an- 
other conversational state, depending of user input. A transition from one state to 
another state defines the user input (DTMF, recognized speech command) in a certain 
state and defines the change into another conversational state. 

Figure 3 shows the dialog structure specification for the use case communication of 
our SeSAM application. The system begins to read the start dialog at the start page 
and ask the caller what to do. The caller could decide by speaking or using DTMF to 
go to services where all information services will be read. From the service overview 
the caller could go to message services, selects there my messages, gets a message list 
and select then his desired message which will be read. 

At the end of this phase you should have a generic FSM for the whole story and spe- 
cial FSM for every use case. All FSMs should be integrated into one FSM. The dialog 
structure specification as a FSM defines the navigation structure of the voice applica- 
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Fig. 3. Dialog Structure Specification through FSMs 



tion but it contains no data about spoken sentences. This is done by a dialog step 
specification in HTML 



2.5 Dialog Step Specification in HTML 

The FSM specification is now added with data of spoken sentences and prompts 
through the HTML dialog step specification. An HTML page contains conversational 
data, prompts and answers for one conversational state and links to other conversa- 
tional states. We map every state of a FSM to one HTML page. The next listing 
shows a dialog step definition of the state services from figure 3. 

<htmlxhead><title> Information Services </title></head> 
<body> 

<p> You have selected information services . </p> 

<p> Please select now your desired service :</p> 

<ol> 

<lixa href = "member . html " > 

Member information </ax/li> 

<lixa href = "messages . html " > Messages </ax/li> 

<lixa href = " appointments . html " > 

Appointments </ax/li> 

</ol> 

</bodyx/html> 
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Every transition of the FSM (a user input which results in a new state) is mapped into 
a hyperlink to another page. It is free to the developer to use graphical mouse hyper- 
links or JavaScript user input functions. On each page the machine says something 
and expects input from the user to go to another page. It is important to keep in mind 
that the specification is the basis for the voice application and text and menus should 
be as short and objective as possible. If there is any dynamically information, e.g. 
from a database, then it will be modeled with static example data. 

At the end of the dialog step definition we have a HTML prototype which behaves 
similar to the VXML application. One advantage of this specification is the graphical 
definition of the dialogs, software developers are used to this. On the other hand you 
get a prototype as a set of HTML files which could be simulated to analyze the dialog 
and to improve the conversation. 



2.6 Media Object Specification 

The static graphical HTML-prototype of the voice application contains navigation 
structure and the 'adornment' of the scenes. It is filled with sample data and contains 
no dynamic information, e.g. from databases. 

Now we design the media objects of our application to add dynamic functionality to 
our prototype. Generally, a media object is an interesting object for the application, 
e.g. messages, appointments, groups or users. From the object oriented point of view 
a media object is the member of a class with certain attributes and methods. From the 
database view a media object could be a view on a database. 

We used the following approach to design our media objects: 

1 . Find nouns which are used in story, use cases, FSM and in the HTML pages 

2. Identify methods which are used in the HTML pages and create object 
names 

3. Categorization of the nouns, identify generalizations 

4. Remove nouns and concepts which does not name individual concepts 

5. Choose short class names for media objects and document each one shortly 

6. Identify attributes, associations, relations ... 

In our application the use case communication uses the media object message. This 
media object message will be used in almost all states (and therefore all HTML 
pages) for communication. In the state messageList the media object message pro- 
vides a function like getMessageList(String messageType) where a certain count of 
messages will be displayed in HTML or read in VXML. 



2.7 Integration of HTML and Media Objects 

The speech application development process is hard to overview; therefore we de- 
velop at first a dynamic HTML voice prototype from our static HTML prototype 
which has the 'same' behavior and functionality as the VXML voice application. 
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The integration of HTML and media objects is done by inserting the media objects 
with their functionality into the HTML pages via special tags. At this point you have 
to choose a programming language and you have to implement the functions of your 
class. The following listing shows HTML code within a Java Server Page which is 
enriched by the media object message. 

<jsp:useBean id= "message" class= " SeSAM . HTML . Message " > 

<% message . setSession ( session) ; %> 

<html> 

<head><title> Message list </title></head> 

<body> 

<p> You are in the <%= message . type%> overview </p> 

<p> Please select now your desired message </p> 

<ol> 

<%if (message . type . equals ( "public " ) j [ 

(message . type . equals ( "my" ) ) 
out . println (message . getMessageList ( ) ) ; %> 

<lixa href = "messages . html " > back to messages 
</a></li> 

<lixa href =" services . html " > back to services 
</ax/li> 

</ol> 

< /bodyx /html> 



The listing shows HTML code with JSP scripting elements. Other object oriented 
programming languages like PHP or C++ could also be used for implementation. 

At the end of this phase we have a dynamic graphical HTML prototype which has the 
same dialog flow and functionality as the VXML application and which uses the same 
data from the database. At the end of this phase the HTML web application has to be 
simulated and reviewed under consideration of conversational and speech aspects. 



2.8 HTML to VXML Translation 

In this step each HTML page is translated to a VXML page. The VXML page could 
use the same media objects, but some output functions have to be modified or added 
to get VXML output instead of HTML output. This process can be automated be 
some software tools. 

The result of this translation is a ready VXML application which could be easily run 
on a normal PC or a VXML based telecommunication system. Future work tends to 
automate this translation process; a specification in XML for the automatic generation 
of HTML and VXML could decrease the development time. 

The following listing shows a typical SeSAM E-Government VXML document 
which is similar to a typical HTML page and which use the dynamic object member. 
It has to be mentioned that the whole application could be tested without media object 
tags, too. 
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<jsp:useBean id= "member" class="SeSAM. VXML. Member "> 

<% member . setSession ( session) ; %> 

<?xml versions " 1 . 0 " ?> 

<vxml version^ " 2 . 0 " > 

<f orm> 

<block>You chose member information services</block> 

< field name="choiceMember"> 

<prompt> Please choose the desired group or other 
services < / enurner atex / pr ompt> 

<option dtmf="<%= member . getDTMFcounter ()%> " 

valuer " services " > back to services </option> 
<%= member . getTopGroupsOptions ( ) %> 

<f illed> 

<%= member . getTopGroupsFilledlf s ( ) %> 

<if cond= " choiceMember== ' services ' "> 

<goto next= " . . /servlet/services " /> 

</if > 

</f illed> 

</f ield> 

</form> 

</vxml> 



2.9 Application Simulation 

The simulation and run of the application is important to validate and check the be- 
havior, the data and the functionality of the system. 

We suggest the following points to perform rapid development and validations: 

1 . Specify and test FSMs for navigation and input validation 

2. Specify and test a static HTML dialog for a use case 

3. Specify and test a static VXML dialog for the use case 

4. Add dynamic media objects to HTML files and test them 

5. Add dynamic media objects to VXML files and test them in a web browser with 
XML support and XML validation (Mozilla, IE) 

6. Test each VXML files in your voice browser (OptimTalk, Elvira) 

7. Test the whole use case without graphical user interface (from login to the end of 
the call) 

8. Call the voice application on a voice platform by phone 

Application simulation starts in the design phase. Any FSM, static HTML pages and 
static VXML pages should be simulated with suitable tools to validate the design and 
the specification of the application. 

At any time it is important to interact with the VXML application and to simulate it - 
without a graphical interface. The tester has to listen to it because the real application 
environments are (mobile) phones or IP-telephony. Through a VXML-simulation you 
can also validate the FSM, HTML and media object specification because the VXML 
programmed code is based on these specifications. 




State- and Object Oriented Specification of Interactive VoiceXML Information Services 



23 



3 Summary 

In this work we present a specification for the rapid development of voice application 
services. The specification is part of the documentation and basis for implementation. 
Due to the co-design of a HTML prototype and a VXML application the specification 
is easily understandable and is based on a robust finite state machine navigation 
structure. Media objects with the according database tables can be added in any object 
oriented programming language. Besides the specification approach speeds up im- 
plementation. 



3.1 Voice Application Development Process 

The application specification is based on use cases, FSMs, HTML and media objects 
which are developed in the phases within the waterfall oriented software development 
process. 

1 . Story specification 

2. Process specification 

3. Final requirement specification (use cases) 

4. Dialog Structure and navigation specification (FSM) 

5. Conversational data specification (static HTML, static VXML) 

6. Media object specification 

7. Integration of media objects and HTML 

8. HTML to VXML translation 

9. Test and simulation 

The development process is heavily divided into phases which are based on previous 
ones and which produces specifications. These specifications are used for documen- 
tation purposes, too. 



3.2 VoiceXML 2.0 System Architecture 

Our voice system uses the free VoiceXML interpreter OptimTalk which was devel- 
oped by the Laboratory of Speech and Dialogue at the Masaryk University in Brno, 
Czech Republic. OptimTalk supports VoiceXML 2.0 as well as DTMF input and can 
be added by a speech recognition component and by a SAPI 5.0 conform synthetic 
voice (TTS) or pre-recorded audio. 

Our voice approach for information systems uses DTMF as standard input for the 
retrieval of information and speech recognition as a secondary input method. DTMF 
and speech recognition are supported by VoiceXML 2.0. The VoiceXML 2.0 based 
voice application could be simulated without phone hardware and it could even run 
on a VoiceXML 2.0 telephony system of an ASP (application service provider). 
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Fig. 4. VoiceXML system architecture 



3.3 Future Work 

We have developed a VoiceXML based speech interface for an E-Government system 
but there a still a lot open points which have to be analyzed. 

Automated development process - Future work tends to automate this development 
process, especially the automatic generation of HTML pages on the basis of FSMs 
and the automatic translation from HTML pages to VXML pages. 

Grammar Generation: dynamic and easy generation of grammars which will be 
used for speech recognition 

Application Service Provider: VXML speech application should be tested of dif- 
ferent telephony platforms of ASPs 

Speech Recognition: VXML application should be work with different speech 
recognition products 

Speech Synthesis: VXML application should be work with different speech syn- 
thesis products 

Languages: VXML application with support of different languages 

Integration GUI - VU1: cooperation of graphical user interface with voice user 

interfaces for same services and applications (E-Mail, ...) 

Speech Technologies: integration of natural language technologies into a VXML 
system 
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Abstract. Dialogs in formal domains, such as mathematics, are charac- 
terized by a mixture of telegraphic natural language text and embedded 
formal expressions. Analysis methods for this kind of setting are rare 
and require empirical justification due to a notorious lack of data, as op- 
posed to the richness of presentations found in genre-specific textbooks. 
In this paper, we focus on dedicated interpretation techniques for major 
phenomena observed in a recently collected corpus on tutorial dialogs 
in proving mathematical theorems. We combine analysis techniques for 
mathematical formulas and for natural language expressions, supported 
by knowledge about domain-relevant lexical semantics and by represen- 
tations relating vague lexical to precise domain terms. 



1 Introduction 

Dialogs in formal domains, such as mathematics, are characterized by a mixture 
of telegraphic natural language text and embedded formal expressions. Acting 
adequately in these kinds of dialogs is specifically important for tutorial purposes 
since several application domains of tutorial systems are formal ones, including 
mathematics. Empirical findings show that flexible natural language dialog is 
needed to support active learning [14], and it has also been argued in favor of 
natural language interaction for intelligent tutoring systems [1]. 

To meet requirements of tutorial purposes, we aim at developing a tutoring 
system with flexible natural language dialog capabilities to support interactive 
mathematical problem solving. In order to address this task in an empirically 
adequate manner, we have carried out a Wizard-of-Oz (WOz) study on tutorial 
dialogs in proving mathematical theorems. In this paper, we report on inter- 
pretation techniques we have developed for major phenomena observed in this 
corpus. We combine analysis techniques for mathematical formulas and for natu- 
ral language expressions, supported by knowledge about domain-relevant lexical 
semantics and by representations relating vague lexical to precise domain terms. 

The outline of this paper is as follows. We first present the environment in 
which this work is embedded, including a description of the WOz experiment. 



F. Meziane and E. Metais (Eds.): NLDB 2004, LNCS 3136, pp. 26—38, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Next, we describe the link we have established between linguistic and domain 
knowledge sources. Then, we give details about interpretation methods for the 
phenomena observed in the corpus, and we illustrate them with an example. 
Finally, we discuss future developments. 




Fig. 1 . Dialog project scenario. 



2 Our Project Environment 

Our investigations are part of the Dialog project 1 [5]. Its goal is (i) to empiri- 
cally investigate the use of flexible natural language dialog in tutoring mathemat- 
ics, and (ii) to develop an experimental prototype system gradually embodying 
the empirical findings. The experimental system will engage in a dialog in writ- 
ten natural language to help a student understand and construct mathematical 
proofs. In contrast to most existing tutorial systems, we envision a modular 
design, making use of the powerful proof system Cmega [17]. This design en- 
ables detailed reasoning about the student’s action and bears the potential of 
elaborate system responses. The scenario for the system is illustrated in Fig. 1: 

— Learning Environment. Students take an interactive course in the relevant 
subfield of mathematics with the web-based system ActiveMath [13]. 

— Mathematical Proof Assistant (MPA): Checks the appropriateness of user 
specified inference steps wrt. to the problem-solving goal, based on Pmega. 

— Proof Manager (PM): In the course of the tutoring session the user may 
explore alternative proofs. PM builds and maintains a representation of con- 
structed proofs and communicates with the MPA to evaluate the appropri- 
ateness of the user’s dialog contributions for the proof contraction. 

— Dialog Manager: We employ the Information-State (IS) Update approach to 
dialog management developed in the Trindi project [18]. 

1 The Dialog project is part of the Collaborative Research Center on Resource- 
Adaptive Cognitive Processes (SFB 378) at University of the Saarland [15]. 
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— Knowledge Resources: This includes pedagogical knowledge (teaching strate- 
gies), and mathematical knowledge (in our MBase system [11]). 

We have conducted a WOz experiment [6] with a simulated system [7] in order 
to collect a corpus of tutorial dialogs in the naive set theory domain. 24 subjects 
with varying educational background and prior mathematical knowledge ranging 
from little to fair participated in the experiment. The experiment consisted of 
three phases: (1) preparation and pre-test on paper, (2) tutoring session mediated 
by a WOz tool, and (3) post-test and evaluation questionnaire , on paper again. 
During the session, the subjects had to prove three theorems ( K and P stand 
for set complement and power set respectively): (i) K((A U B) (~l (C U D)) = 
(. K(A ) n K(B)) U (K(C) n K(D)); (ii) A n B G P((A U C) D (B U C)) and 
(iii) If A C K(B), then B C K(A). The interface enabled the subjects to type 
text and insert mathematical symbols by clicking on buttons. The subjects were 
instructed to enter steps of a proof rather than a complete proof at once, in order 
to encourage guiding a dialog with the system. The tutor-wizard’s task was to 
respond to the student’s utterances following a given algorithm [8]. 

3 Phenomena Observed 

We have identified several kinds of phenomena, which bear some particularities 
of the genre and domain, and we have categorized them as follows (see Fig. 2): 

— Interleaving text with formida fragments: Formulas may not only be intro- 
duced by natural language statements (1), they may also be enhanced by 
natural language function words and connectives (2), (3), or natural lan- 
guage and formal statements may be tightly connected (4). The latter ex- 
ample poses specific analysis problems, since only a part (here: variable x ) 
of a mathematical expression (here: x G B) lies within the scope of a natural 
language operator adjacent to it (here: negation). 

— Informal relations: Domain relations and concepts may be described im- 
precisely or ambiguously using informal natural language expressions. For 
example, “to be in” can be interpreted as “element”, which is correct in 
(5), or as “subset”, which is correct in (6); and “both sets together ” in (7) 
as “union” or “intersection”. Moreover, common descriptions applicable to 
collections need to be interpreted in view of the application to their mathe- 
matical counterparts, the sets: the expressions “completely outside” (8) and 
“completely different” (9) refer to relations on elements of the sets compared. 

— Incompleteness: A challenge for the natural language analysis lies in the large 
number of unexpected synonyms, where some of them have a metonymic 
flavor. For example, “left side” (12) refers to a part of an equation, which is 
not mentioned explicitly. Moreover, the expression “inner parenthesis” (11) 
requires a metonymic interpretation, referring to the expression enclosed by 
that pair of parentheses. Similarly, the term “complement” (10) does not refer 
to the operator per se, but to an expression identifiable by this operator, that 
is, where complement is the top-level operator in the expression referred to. 




Operators! Incompleteness | Informal relations | Interleaving mode| 
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( 1) Nach DeMorgan-Regel-2 ist K((A UB)n(CU D)) = ( K(A U£)U K(C U D)) 

According to DeMorgan-Rule-2 K((AUB)n(CUD)) = (K(A\JB)\JK(CL) 
D )) holds 

( 2) A auch C K(B) 

A also C K(B) 

(3) A n B ist £ von C U (A n B), da ja A n B = 0 

An B is £ of C U (A n B), because An B = 0 

( 4) B enthalt keiii x £ A 

B contains no x £ A 

(5) Da A C K(B) gilt, sind alle x, die in A sind, nicht in B 

As A C K(B) applies, all x, that are in A, are not in B 

(6) (A U B) mufi in P((A U C) fl (B U C)) sein, da ( A n B) £ (A n B) U C 

( A U B ) must be in P((A U C) D (B U C)), since (A n B) £ (A fl B) U C 

( 7) Wenn A Teilmenge von C und B Teilmenge von C dann miissen beide Mengen 
zusammen ebenfalls eine Teilmenge von C sein. 

If A is a subset of C and B a subset of C , then both sets together must 
also be a subset of C. 

( 8) B mufi vollstandig aufierhalb von A liegen, also im Komplement von A 

B has to be entirely outside of A, so in the complement of A 

( 9) Dann sind A und B vollkommen verschieden, haben keine gemeinsamen Ele- 
mente 

Then A and B are completely different, have no common elements 

( 10) K((A U B) n (C U D)) = K(A U B) U K(C U D) de Morgan Regel 2 auf 
beide Komplemente angewendet 

K((A UB)n(CU D )) = K(A U B) U K(C U D) de Morgan rule 2 applied 
to both complements 

(11) Distributivitat von Vereinigung fiber Durchschnitt: A U (B fl C) = {A U B) fl 
(A U C) Hier dann also: C U {A n B) = (A U C) fl (B U C) Dies ffir die innere 
Kla mmer 

Distributivity of union over intersection: A\J (B n C) = (A U B) n (A U C) 
Here: C U (A PI B) = (A U C) fl (B U C) This for the inner paranthesis 

( 12) A(~)B auf der linken Seite ist £ von CU (AfiB) , was ja nur durch C erweitert 
wird 

An B on the left side is £ of C U (An B), which is extended only by C 

(13) Wenn alle A in K(B) enthalten sind und dies auch umgekehrt gilt, mufi es 
sich um zwei identische Mengen handeln 

If all A are contained in K(B ) and this also holds vice-versa, these must 
be identical sets 

( 14) Mengenvereinigung ist symmetrisch 

Set union is symmetrical 



Fig. 2. Examples of dialog utterances (not necessarily correct in a mathematical sense). 
The predicates P and K stand for power set and complement, respectively. 





30 



H. Horacek and M. Wolska 



— Operators : Semantically complex operators require a domain-specific inter- 
pretation, such as “vice-versa” in (13). Occasionally, natural language refer- 
ential access to mathematical concepts deviates from the proper mathemat- 
ical conception. For example, the truth of some axiom, when instantiated 
for an operator, might be expressed as a property of that operator in natu- 
ral language, such as “symmetry” as a property of “set union” (14). In the 
domain of mathematics, this situation is conceived as an axiom instantiation. 
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Fig. 3. A fragment of the intermediate representation of objects 



4 Intermediate Knowledge Representation 

In order to process adequately utterances such as the ones discussed in the 
previous section, natural language analysis methods require access to domain 
knowledge. However, this imposes serious problems, due to the fundamental rep- 
resentation discrepancies between knowledge bases of deduction systems, such as 
our system 1?MEGA, and linguistically- motivated knowledge bases, as elaborated 
in [10]. The contribution of the intermediate knowledge representation explained 
in this section is to mediate between these two complementary views. 

In brief, Iomega’s knowledge base is organized as an inheritance network, 
and representation is simply concentrated on the mathematical concepts per se. 
Their semantics is expressed in terms of lambda-calculus expressions which con- 
stitute precise and complete logical definitions required for proving purposes. 
Inheritance is merely used to percolate specifications efficiently, to avoid redun- 
dancy and to ease maintenance, but hierarchical structuring is not even imposed. 
Meeting communicating purposes, in contrast, does not require access to com- 
plete logical definitions, but does require several pieces of information that go 
beyond what is represented in Omega’s knowledge base. This includes: 

— Hierarchically organized specialization of objects, together with their prop- 
erties, and object categories for its fillers, enabling, e.g., type checking. 
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— The representation of vague and general terms which need to be interpreted 
in domain-specific terms in the tutorial context. 

— Modeling of typographic features representing mathematical objects “physi- 
cally”, including markers and orderings, such as argument positions. 
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Fig. 4. A fragment of the intermediate representation of relations 



In order to meet these requirements, we have built a representation that con- 
stitutes an enhanced mirror of the domain representations in i?MEGA. It serves 
as an intermediate representation between the domain and linguistic models. 
The domain objects and relations are reorganized in a specialization hierarchy 
in a KL-ONE like style, and prominent aspects of their semantics are expressed 
as properties of these items, with constraints on the categories of their fillers. 
For example, the operator that appears in the definition of the symmetry axiom 
is re-expressed under the property central operator, to make it accessible to 
natural language references (see the lower right part of Fig. 3). 

In the representation fragments in Fig. 3 and 4, objects and relations are 
referred to by names in capital letters, and their properties by names in small 
letters. Properties are inherited in an orthogonal monotonic fashion. Moreover, a 
specialization of a domain object may introduce further properties, indicated by 
a leading ’+’ in the property name, or it may specialize properties introduced by 
more general objects, which is indicated by the term ’spec’ preceding the more 
specific property name. In addition, value restrictions on the property fillers may 
be specified, which is indicated by the term ’restr’ preceding the filler name, and 
an interval enclosed in parentheses which expresses number restrictions. 

These re-representations are extended in several ways. Some of the objects 
may be associated with procedural tests about typographical properties, which 
are accessible to the analysis module (not depicted in Fig. 3 and 4). They 
express, for instance, what makes a “parenthesis” an “inner parenthesis”, or 
what constitutes the “embedding” of a formula. Another important extension 
comprises modeling of typographic features representing mathematical objects 
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“physically” . This includes markers such as parentheses, as well as orderings, 
such as the sides of an equation. They are modeled as properties of structured 
objects, in addition to the structural components which make up the semantics 
of the logical system. Moreover, typographic properties may be expressed as 
parts of specializations, such as bracket-enclosed formulas as a specific kind of 
a (sub-)formula (Fig. 3). A further extension concerns vague and general terms, 
such as “containment” and “companion”, represented as semantic roles. They 
are conceived as generalizations of mathematical relations, in terms of the se- 
mantics associated with these roles (Fig. 4). For example, a “containment” holds 
between two items if the first one belongs to the second, or all its components 
separately do, which applies to “subset” and “element-of” relations. Similarly, 
“companion” comprises “union” and “intersection” operations. 

5 Analysis Techniques 

In this section, we present the analysis methodology and show interactions with 
the knowledge representation presented in Sect. 4. The analysis proceeds in 3 
stages: (i) Mathematical expressions are identified, analyzed, categorized, and 
substituted with default lexicon entries encoded in the grammar (Sect. 5.1); 
(ii) Next, the input is syntactically parsed, and a representation of its linguistic 
meaning is constructed compositionally along with the parse (Sect. 5.2); (iii) The 
linguistic meaning representation is subsequently embedded within discourse 
context and interpreted by consulting the semantic lexicon (Sect. 5.3) and the 
ontology (Sect. 4). 

5.1 Analyzing Formulas 

The task of the mathematical expression parser is to identify mathematical con- 
tent within sentences. The identified mathematical expressions are subsequently 
verified as to syntactic validity, and categorized as of type CONSTANT, TERM, 
FORMULA, 0 -FORMULA (formula missing left argument), etc. 

Identification of mathematical expressions within the word-tokenized text is 
based on simple indicators: single character tokens, mathematical symbol Uni- 
codes, and new-line characters. The tagger converts the infix notation into an 
expression tree from which the following information is available: surface sub- 
structure (e.g., “left side” of an expression, list of sub-expressions, list of brack- 
eted sub-expressions), and expression type (based on the top level operator). 

For example, the expression K((A U B) fl (C U D)) = K(A U B) U K(C U D) 
in utterance (10) in Fig. 2, is of type FORMULA (given the expression’s top node 
operator, =), its “left side” is the expression K((A U B) D (C U D)), the list of 
bracketed sub-expressions includes: A U B, C U D, (A U B) (~l (C U D), etc. 

5.2 Analyzing Natural Language Expressions 

The task of the natural language analysis module is to produce a linguistic 
meaning representation of sentences and fragments that are syntactically well- 
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formed. The sentence meaning obtained at this stage of processing is independent 
of the domain-specific meaning assigned at the next stage (Sect. 5.3). 

By linguistic meaning, we understand the deep semantics in the sense of the 
Prague School notion of sentence meaning as employed in the Functional Gen- 
erative Description (FGD) [16,12]. In the Praguian FGD approach, the central 
frame unit of a sentence/clause is the head verb which specifies the roles of its de- 
pendents (or participants ) . Further distinction is drawn into inner participants, 
such as Actor, Patient, Addressee , and adverbial free modifications, such as Lo- 
cation, Means, Direction. To derive our set of semantic relations we generalize 
and simplify the collection of Praguian tectogrammatical relations in [9]. The 
reason for this simplification is, among others, to distinguish which of the roles 
have to be understood metaphorically given our specific sub-language domain 
(e.g., Formula as an Actor). The most commonly occurring roles in our context 
are those of Cause, Condition , and Result- Conclusion, for example Cause in ut- 
terance (5) and Condition in utterance (7) in Fig. 2. Others include Location, 
Property, and GeneralRelation. 

The analysis is performed using openCCG, an open source multi-modal com- 
binatory categorial grammar (MMCCG) parser 2 . MMCCG is a lexicalist gram- 
mar formalism in which application of combinatory rules is controlled though 
context-sensitive specification of modes on slashes [2,4]. The LM, built in par- 
allel with the syntax, is represented using Hybrid Logic Dependency Semantics 
(HLDS), a hybrid logic representation which allows a compositional, unification- 
based construction of HLDS terms with CCG [3] . Dependency relations between 
heads and dependents are explicitly encoded in the lexicon as modal relations. 

For example, in the utterance (1) in Fig. 2 “ist” represents the meaning 
hold, and in this frame takes dependents in the tectogramatical relations Norm- 
Criterion and Patient. The identified mathematical expression is categorized as 
of type FORMULA (reference to the structural sub-parts of the entity FORMULA 
are available through the information from the mathematical expression tag- 
ger). The following hybrid logic formula represents the linguistic meaning of this 
utterance (dMR2 denotes the lexical entry for deMorgan- Rule-2): 

@hl (holds A <NORM>(dl A dMR2) A <PAT>(fl A FORMULA)) 

where hi is the state where the proposition holds is true, and nominals ell and 
fl represent the dependents of kinds Norm and Patient respectively, of holds. 

Default lexical entries (e.g. FORMULA; cf. Sect. 5.1), are encoded in the sen- 
tence parser grammar for the mathematical expression categories. At the formula 
parsing stage, they are substituted within the sentence in place of the symbolic 
expressions. The sentence parser processes the sentence without the symbolic 
content. The syntactic categories encoded in the grammar for the lexical entry 
FORMULA are S, NP , and N . 

5.3 Domain Interpretation 

The semantic lexicon defines linguistic realizations of conceptual predicates and 
provides a link to their domain interpretations through the ontology (cf. Sect. 4). 
2 http:/ /openccg. sourceforge.net 
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Lexical semantics in combination with the knowledge encoded in the ontol- 
ogy allows us to obtain domain-specific interpretations of general descriptions. 
Moreover, productive rules for treatment of metonymic expressions are encoded 
through instantiation of type compatible counterparts. If more than one interpre- 
tation is plausible, no disambiguation is performed. Alternative interpretations 
are passed on to the Proof Manager (cf. Fig. 1) for evaluation by the theorem 
prover. Below we explain some of the entries the lexicon encodes (cf. Fig. 5): 



(15) COnta,m(ACTtype:FORMULA, P ATtype-.FORMULA ) 

= (subformula pat, embeddingAcr) 
contam(ACT type :OBjECT, P AT type: oBjECT) 

= CONTAlNMENT(container act, containeepAr) 
( 16) m(ACTtype:OBJECT,LOCtype-.OBJECT) 

= containment (co nt ainerpoc*, containeeAC"r) 
( 17) OUtside(ACTtype:OBJECT,LOCtype:OBJECT) 

= nOt(m(ACTtype:OBJECT,LOCtype:OBJECT)) 

(18) common (Property, ACT p i ura i (A:SET , B:SET) ) 

= Property(pl , A) A Property(pl, B) 

(19) common (element, ACTpi ura i( A . S ET,B-.SET)) 

= element(pi,A) A element(pi, B) 

(20) differeiit(AC'r p i ur . a ;(A:SBT,s:SBT)) = A^ B 
different {ACT p i ural ( A . SE T,B-.SET)) 

= (ei element A A e2 element B => e i ^ e2) 
different(ACTpi ura i( A .STRucTURED object,b-.structured object )) 

= ( Property\{pl , A) A Property 2 (p2, B) A Propertyi = Property 2 => pi A p2) 



Fig. 5. An excerpt of the semantic lexicon 



— Containment The containment relation, as indicated in Fig. 4 specializes 
into the domain relations of (strict) SUBSET and element. Linguistically, it 
can be realized, among others, with the verb “enthalten” (“contain”). The 
tectogrammatical frame of “enthalten” involves the roles of Actor (ACT) 
and Patient (PAT). Translation rules (15) serve to interpret this predicate. 

— Location The Location relation, realized linguistically by the prepositional 
phrase “in...(sein)” (“be in”) involves the tectogrammatical relations of Lo- 
cation (LOC) and the Actor of the predicate “sein” . We consider Location in 
our domain as synonymous with Containment. Translation rule (16) serves 
to interpret the tectogrammatical frame of one of the instantiations of the 
Location relation. Another realization of the Location relation, dual to the 
above, occurs with the adverbial phrase “aufierhalb von ...(liegen)” (“lie out- 
side of”) and is defined as negation of Containment (17). 

— Common property A general notion of “common property” we define as in 
(18). The Property here is a meta-object which can be instantiated with any 
relational predicate, for example, realized by a Patient relation as in “(A unci 
B)<act> haben (gemeinsame Elemente) <pat>” (“A and B have common 
elements”). In this case the definition (18) is instantiated as in (19). 
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— Difference The Difference relation, realized linguistically by the predicates 
“versclrieden (sein)” (“be different”; for COLLECTION or structured ob- 
jects) and “disjunkt (sein)” (“be disjoint”; for objects of type collection) 
involves a plural Actor (e.g. coordinated noun phrases) and a HasProperty 
tectogrammatical relations. Depending on the domain type of the entity in 
the Actor relation, the translations are as in (20). 

— Mereological relations Here we encode part-of relations between domain ob- 
jects. These concern both physical surface and ontological properties of ob- 
jects. Commonly occurring part-of relations in our domain are: 

lias Component (STRUCTURED OB jecTterm, formula , 

STRUCTURED OBJECTsuBTERM, SUBFORMUL a) 
liasComponent (structured OBJECTterm, formula, 

STRUCTURED OBJECTeNCLOSED TERM, ENCLOSED FORMULA) 

liasComponent (structured OBJECTterm, formula, 

STRUCTURED OBJECTterm component, FORMULA component) 

Moreover, we have from the ontology (cf. Fig. 3): 

Property (STRUCTURED OBJECTterm, FORMULA, componentterm side, formula side) 

Using these definitions and polysemy rules such as polysemous (Object, Prop- 
erty), we can obtain interpretation of utterances such as “Dann gilt fiir 
die linke Seite, ... ” (“Then for the left side it holds that ... ”) where 
the predicate “gilt” normally takes two arguments of types STRUCTURED 
OBJECTterm, formula, rather than an argument of type Property. 

6 Example Analysis 

Here, we present an example analysis of the utterance “B contains no xcA” ((4) 
in Fig. 2) to illustrate the mechanics of the approach. 

In the given utterance, the scope of negation is over a part of the formula 
following it, rather than the whole formula. The predicate contain represents 
the semantic relation of CONTAINMENT and is ambiguous between the domain 
readings of (strict) subset, element, and SUBFORMULA. 

The formula tagger first identifies the formula <xG A> and substitutes it with 
the generic entry FORMULA represented in the lexicon of the grammar. If there 
was no prior discourse entity for “B” to verify its type, the type is ambiguous be- 
tween CONSTANT, term, and FORMULA. The sentence is assigned four alternative 
readings: “CONST contains no FORMULA”, “term contains no FORMULA”, “FOR- 
MULA contains no FORMULA”, and “CONST contains no CONST 0 -FORMULA”. The 
last reading is obtained using shallow rules for modifiers (identified immediately 
before the formula) that take into account their possible interaction with mathe- 
matical expressions. Here, given the preceding quantifier, the expression <xgA> 
has been split into its surface parts, <[x][eA]>, [x] has been substituted with a 
lexical entry CONST, and [gA] with an entry for a formula missing its left argu- 
ment, 0 -FORMULA 3 (cf. Sect. 5.1). The first and the second readings are rejected 

3 There are other ways of constituent partitioning of the formula to separate the 
operator and its arguments (they are: <[x][g][A]> and <[xG][A]>). Each of the 
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because of sortal incompatibility. The resulting linguistic meanings and readings 
of the sentence are (i) for the reading “FORMULA contains no FORMULA”: 

s:(@kl(keinA<RESTR>f2A<BODY> (el Aenthalten 
A<ACT>(flAFORMULA)A<PAT>f2))A@f2(FORMULA)) 

’formula B contains no (sub-)formula ‘xGA” 

and (ii) for the reading “CONST contains no CONST CLformula”: 
s:(@kl(keinA<RESTR>xlA<BODY> (el Aenthalten 

A<ACT>(c1AC0nst)A<PAT>x1))A @x1(constA<HASPROP>(x2AO_formula))) 
’constant B contains no constant x such that x is an element of A ’ 

The semantic lexicon is consulted to translate the readings into their domain 
interpretation. The relevant entries are (15) in the Fig. 5. Four interpretations of 
the sentence are obtained using the LMs, the semantic lexicon, and the ontology: 
(i) for the reading “FORMULA contains no FORMULA”: 

(1) ’it is not the case that <PAT>, formula x£A, is a subformula of <ACT>, formula B ’ 

and, (ii) for the reading “CONST contains no CONST CLformula” 

(2a) ’it is not the case that <PAT>, the constant x, C <ACT>, B, and xSA’, 

(2b) ’it is not the case that <PAT>, the constant x, G <ACT>, B, and x G A ’ . 

(2c) ’it is not the case that <PAT>, the constant x, C <ACT>, B, and x G A ’ . 

The first interpretation, (1), is verified in the discourse context with information 
on structural parts of the discourse entity “B” of type FORMULA, while the other 
three, (2a-c), are translated into messages to the Proof Manager and passed on 
for evaluation in the proof context. 

7 Conclusions and Future Research 

In this paper, we have presented methods for analyzing telegraphic natural lan- 
guage text with embedded formal expressions. We are able to deal with major 
phenomena observed in a corpus study on tutorial dialogs about proving math- 
ematical theorems, as carried out within the Dialog project. Our techniques 
are based on an interplay of a formula interpreter and a linguistic parser which 
consult an enhanced domain knowledge base and a semantic lexicon. 

Given the considerable demand on interpretation capabilities, as imposed by 
tutorial system contexts, it is hardly surprising that we are still at the beginning 
of our investigations. The most obvious extension for meeting tutorial purposes 
is the enablement to deal with errors in a cooperative manner. This requires 
the two analysis modules to interact in an even more interwoven way. Another 
extension concerns the domain-adequate interpretation of semantically complex 
operators such as ’vice-versa’ as in (13) Fig. 2. ’Vice-versa’ is ambiguous here in 
that it may operate on immediate dependent relations or on the embedded rela- 
tions. The utterance “and this also holds vice-versa” in (13) may be interpreted 

partitions obtains its appropriate type corresponding to a lexical entry available in 
the grammar (e.g., the [xG] chunk is of type formula_0 for a formula missing its 
right argument). Not all the readings, however, compose to form a syntactically and 
semantically valid parse of the given sentence. 
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as “alle K(B) in A enthalten sind” (“all K(B) are contained in A”) or “alle B 
in K(A) enthalten sind” (“all B are contained in K(A)”) where the immediate 
dependent of the head enthalten and all its dependents in the Location relation 
are involved ( K(B )), or only the dependent embedded under GeneralRelation 
(complement, K). Similarly, “head switching” operators require more complex 
definition. For example, the ontology defines the theorem symmetry (or simi- 
larly distributivity, commutativity) as involving a functional operator and 
specifying a structural result. On the other hand, linguistically, “symmetric” is 
used predicatively (symmetry is predicated of a relation or function). 

A further yet to be completed extension concerns modeling of actions of 
varying granularity that impose changes on the proof status. In the logical sys- 
tem, this is merely expressed as various perspectives of causality, based on the 
underlying proof calculus. Dealing with all these issues adequately requires the 
development of more elaborate knowledge sources, as well as informed best-first 
search strategies to master the huge search space that results from the tolerance 
of various kinds of errors. 
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Abstract. In this paper, a method of event ordering based on temporal 
information resolution is presented. This method consists of two main 
steps: on the one hand, the recognition and resolution of the temporal 
expressions that can be transformed on a date, and therefore these dates 
establish an order between the events that contain them. On the other 
hand, the detection of temporal signals, for example after, that can not 
be transformed on a concrete date but relate two events in a chronologi- 
cal way. This event ordering method can be applied to Natural Language 
Processing systems like for example: Summarization, Question Answer- 
ing, etc. 



1 Introduction 

Nowadays, the information society needs a set of tools for the increasing amount 
of digital information stored in the Internet. Documental database applications 
help us to manage this information. However, documental database building 
requires the application of automatic processes in order to extract relevant in- 
formation from texts. 

One of these automatic processes is event ordering by means of temporal 
information. Usually, a user needs to obtain all the information related to a spe- 
cific event. To do this, he must know the relationships between other events, 
and their chronological information. The automatic identification of temporal 
expressions associated with events, temporal signals that relate events, and fur- 
ther treatments of them, allow the building of their chronograplric diagram. 
Temporal expressions treatment is based on establishing relationships between 
concrete dates or time expressions (25th December 2002) and relative dates or 
time expressions (the day before). Temporal signals treatment is based on de- 
termining the temporal relationship between the two events that the signal is 
relating. Using all this information, the application of event-ordering techniques 
allows us to obtain the desired event ordering. 

This paper has been structured in the following way: first of all, section 2 
shows a short introduction to the main contributions of previous work. Then, sec- 
tion 3 describes the Event Ordering system and the different units of the system: 

* This paper has been supported by the Spanish government, projects FIT-150500- 
2002-244, FIT-150500-2002-416, TIC-2003-07158-C04-01 and TIC2000-0664-C02-02 
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the Temporal Information Detection unit, the Temporal Expression Coreference 
Resolution unit, the Ordering Keys unit and the Event Ordering unit. Follow- 
ing this, there is a graphical example of how the event ordering method works. 
In section 4, the application of this event ordering method in one task of NLP 
(Question Answering) is explained. Finally, the evaluation of TERSEO system 
in Spanish and some conclusions are shown. 

2 Previous Work 

At the moment there are different kinds of systems that cope with the event 
ordering issue. Some of the current systems are based on knowledge like Fila- 
tova and Hovy[3] which describes a procedure for arranging into a time-line the 
contents of news stories describing the development of some situation. The sys- 
tem is divided in two main parts: firstly, breaking sentences into event-clauses 
and secondly resolving both explicit and implicit temporal references. Evalua- 
tions show a performance of 52%, compared to humans. Schilder and Habel[10] 
system is knowledge based as well. This system detects temporal expressions 
and events and establishes temporal relationships between the events. Works 
like Mani and Wilson [5] develop an event-ordering component that aligns events 
on a calendric line, using tagged TIME expressions. By contrast, to some other 
important systems are based on Machine Learning and focused on Event Or- 
dering, for instance, Katz and Arosio[4], Setzer and Gaizauskas[ll]. This last 
one is focused on annotating Event-Event Temporal Relations in text, using a 
time-event graph which is more complete but costly and error-prone. 

Although systems based on Machine Learning obtain high precision results 
applied to concrete domains, these results are lower when these kind of systems 
are applied to other domains. Besides, they need large annotated corpus. On the 
other hand, systems based on knowledge have a greater flexibility to be applied 
to any domain. Our proposal is a hybrid system (TERSEO) that takes profit of 
the advantages of both kind of systems. TERSEO has a knowledge database but 
this database has been extended using an automatic acquisition of new rules for 
other languages [7]. That is why TERSEO is able to work in a multilingual level. 
However, in this article we are focused on the event ordering method based on 
TERSEO system. The description of the Event Ordering System is made in the 
following section. 

3 Description of the Event Ordering System 

The graphic representation of the system proposed for event ordering is shown 
in Figure 1. Temporal information is detected in two steps. First of all, tempo- 
ral expressions are obtained by the Temporal Expression Detection Unit. After 
that, the Temporal Signal Detection Unit returns all the temporal signals. The 
Temporal Expressions (TEs) that have been recognized are introduced into the 
resolution unit, which will update the value of the reference (document’s date 
at first) according to the date it refers to and generates XML tags for each 
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expression. These tags are part of the input of an event ordering unit. The tem- 
poral signals that have been obtained are introduced into a unit that obtains the 
ordering keys for each temporal signal. The ordering key establishes the order 
between two events and is used by the event ordering unit as well. With all this 
information the event ordering unit is able to return the ordered text. Temporal 
Expressions and Temporal signals are explained in detail below. 




Fig. 1 . Graphic representation of the Event Ordering System 



3.1 Temporal Information Detection 

The Temporal Information Detection Unit is divided in two main steps: 

— Temporal Expressions Detection 

— Temporal Signal Detection 

Both steps are fully explained in following sections but they share a common 
preprocessing of the texts. Texts are tagged with lexical and morphological infor- 
mation by a Pos Tagger and this information is the input to a temporal parser. 
This temporal parser is implemented using an ascending technique (chart parser) 
and it is based on a temporal grammar [9]. 



Temporal Expressions Detection. One of the main tasks involved in trying 
to recognize and resolve temporal expressions is to classify them, because the way 
of solving them depends on the type of expression. In this paper, two proposals 
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for the classification of temporal expressions are shown. The first classification 
is based on the kind of reference. This classification is focused on recognizing 
the kind of expression when this enters the system and needs to be resolved. In 
addition, another type of classification is presented. This one is focused on the 
kind of output returned by the system for that type of expression. 

Classification of the expression based on the kind of reference 

• Explicit Temporal Expressions. 

* Complete Dates with or without time expressions: “11/01/2002” (01/ 
11/2002), “el 4 de enero de 2002” (January 4th, 2002),... 

* Dates of Events: 

■ Noun Phrase with explicit date: “el curso 2002-2003” (2002-2003 
course). In this expression, “course” denotes an event 

• Noun Phrase with a well-known date: “Navidad” (Christmas),... 

• Implicit Temporal Expressions. 

* Expressions that refer to the Document date : 

• Adverbs or adverbial phrases: “ayer” (yesterday),... 

• Noun phrases:“el proximo mes” (the next month),... 

■ Prepositional phrases: “en el mes pasado” (in the last month),... 

* Expressions that refers to another date: 

• Adverbial Phrases: “durante el curso” ( during the course ),... 

• Noun Phrases: “un mes despues” (a month later), “despues de 
la proxima Navidad” (after next Christmas),... For example, with 
the expression “after next Christmas” it is necessary to resolve 
the TE “next Christmas” and then apply the changes that the 
word “after” makes on the date obtained. 

• Prepositional Phrases: “desde Navidad” (from Christ- 

mas), “desde la anterior Navidad” (since last Christmas),... 

— Classification by the representation of the temporal value of the 
expression 

• Concrete. All those that give back a concrete day or/and time with for- 
mat: dd/mm/yyyy (lrh:mm:ss) (mm/dd/yyyy (hh:mm:ss)), for example: 
“ayer” (yesterday). 

• Period. All those expressions that give back a time interval or range of 
dates: [dd/mm/yyyy-dd/mm/yyyy] ([mm/dd/yyyy-mm/dd/yyyy]), for 
example: “durante los siguientes cinco dias” (during the five following 
days). 

• Fuzzy. It gives back an approximate time interval because it does not 
know the concrete date that the expression refers to. There are two types: 

* Fuzzy concrete. If the given result is an interval but the expression 
refers to a concrete day within that interval, and we do not know 
it accurate. For that reason we must give back the approach of the 
interval, for example: “un dia de la semana pasada” (a day of the 
last week),... 

* Fuzzy period. If the expression refers to an interval contained within 
the given interval, for instance: “hace unos dias” (some days before), 
“durante semanas” (during weeks),... 
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In section 5, we will see how the system is able to solve great part of these 
temporal expressions, that have been recognized by the Temporal Expression 
Unit. 



Temporal Signals Detection. The temporal signals relate the different events 
in texts and establish a chronological order between these events. In an ex- 
perimental way, after the study of a training corpus, a set of temporal signals 
has been obtained, and some of them are emphasized here: despues (after), 
cuando (when), antes (before), durante (during), previamente (previously), desde 
... lrasta... (from ... to ...), en (on, in), mientras (while) , por (for), en el momento 
de (at the time of), desde (since), etc. 



3.2 Temporal Expression Coreference Resolution 

Temporal Expression Coreference Resolution is organized in two different tasks: 

— Anaphoric relation resolution based on a temporal model 

— Tagging of temporal Expressions 

Every task is explained next. 



Anaphoric relation resolution based on a temporal model. For the 

anaphoric relation resolution we use an inference engine that interprets every 
reference named before. In some cases the references are estimated using the 
document’s date (FechaP). Others refer to a date named before in the text that 
is being analyzed (FeclraA). For these cases, a temporal model that allows to 
know on what date the dictionary operations are going to be done, is defined. 
This model is based on the two rules below and it is only applicable to these 
dates that are not FechaP, since for FechaP there is nothing to resolve: 

1. By default, the newspaper’s date is used as a base referent (TE) if it exists, 
if not, the system date is used. 

2. If a non-anaplroric TE is found, this is stored as FechaA. This value is up- 
dated every time that a non-anaplroric TE appears in the text. 

In Table l 1 some of the entries in the dictionary used in the inference engine 
are shown. The unit that makes the estimation of the dates will accede to the 
right entry in the dictionary in each case and it will apply the function specified 
obtaining a date in the format dd/mm/yyyy (mm/dd/yyyy) or a range of dates. 
So, at that point the anaphoric relation will have been resolved. 

1 The operation ‘+1’ in the dictionary is able to interpret the dates in order to give 
back a valid date. For example, if the Month (date) function gives back 12 and the 
operation ‘+1’ is done on that value, the given back value will be 01, increasing a 
year. 




44 



E. Saquete, R. Munoz, and P. Martfnez-Barco 



Table 1. Sample of some of the entries in the dictionary 



REFERENCE 


DICCIONARY ENTRY 


‘ ayer ’ ( yesterday ) 
‘manana ’ ( tomorrow ) 


Day (FechaP) -1 /Month (FechaP) /Year (FechaP) 
Day (FechaP)-H /Month (FechaP) /Year (FechaP) 


‘durante el mes siguiente’ 
(during the following month) 

num+'anos siguientes’ 
(num years later) 


[Day I/Month(FechaA)+l /Year (FechaA) — 

DayF/Month(FechaA)+l/Year (FechaA)] 

[01/01/Year (FechaA) -i-num — 

31/ 12/Year (FechaA) -i-num] 


‘un dia antes ’(a day before) 


Day (FechaA) -1 /Month (FechaA) /Year (FechaA) 


‘dias despues’ (some days later) 
‘di'as antes’ (some days before) 


»>>FechaA 

«<<FechaA 



Tagging of temporal expressions. Several proposals for the annotation of 
TEs have arisen in the last few years Wilson et al. [13], Katz and Arosio [4], 
TIMEML[6], etc. since some research institutions have started to work on dif- 
ferent aspects of temporal information. In this section, our own set of XML tags 
is defined in order to standardize the different kinds of TEs. We have defined 
a simple set of tags and attributes that adjust to the necessities of our system, 
without complicating it. Besides, they could be transformed to other existing 
formats, like TIMEML, at any time. These tags show the following structure: 

— For Explicit Dates: 

<DATE_TIME ID= ‘ ‘ value ’ ’ TYPE= 1 ‘ value ’ ’ VALDATE1= 1 1 value ’ ’ 
VALTIME1= ‘ 1 value ’ ’ VALDATE2= ‘ 1 value ’ ’ VALTIME2= ‘ 1 value ’ 1 
VAL0RDER=‘ ‘value’ ’ >expression</DATE_TIME> 

— For Implicit Dates: 

<DATE_TIME_REF ID=“ value” TYPE= ‘ ‘ value ” VALDATE1=‘ ‘value” 
VALTIME1= ‘ ‘ value ’ ’ VALDATE2= ‘ ‘ value ’ ’ VALTIME2= ‘ ‘ value ’ ’ 
VAL0RDER=‘ ‘value’ ’ >expression</DATE_TIME_REF> 

DATE_TIME is the name of the tag for explicit TEs and DATE_TIME REF 
is the name of the tag for implicit TEs. Every expression has an numeric ID 
to be identified and VALDATE# and VALTIME# store the range of dates and 
times obtained from the inference engine, where VALDATE2 and VALTIME2 
is only used to establish ranges. Also, VALTIME1 could be omitted if only a 
date is specified. VALDATE2, VALTIME1 and VALTIME2 are optional args. 
VALORDER is the attribute where the event ordering unit will specify the 
ordering value, at first there is no value for this attribute. After that, a structured 
document is obtained. The use of XML allows us to take advantage of the XML 
schema in which the tag language is defined. This schema lets an application 
know if the XML file is valid and well-formed. A parser of our XML needs to be 
defined to make the information useful. 
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3.3 Ordering Keys Obtaining 

The temporal signals obtained by the Temporal Signal Detection Unit are used 
by this unit to obtain the ordering keys. The study of the corpus revealed a set of 
temporal signals. Each temporal signal denotes a relationship between the dates 
of the events that it is relating. For example, in EV1 S EV2, the signal S denotes 
a relationship between EV1 and EV2. Assuming that FI is the date related to 
the first event and F2 is the date related to the second event, the signal will 
establish a certain order between these events. This order will be established by 
the ordering unit. Some of the ordering keys, which are the output of this unit, 
are shown in Table 2. 



Table 2. Output of the ordering Key Obtaining Unit 



SIGNAL 


ORDERING KEY 


After 


FI > F2 


When 


FI = F2 


Before 


FI < F2 


During 


F2i <= FI <= F2f 


Previously 


FI > F2 


From F2 to F3 


F2 <= FI <= F3 


About F2 — F3 


F2 <= FI <= F3 


On / in 


FI = F2 


While 


F2i <= FI <= F2f 


For 


F2i <= FI <= F2f 


At the time of 


FI = F2 


Since 


FI > F2 



3.4 Event Ordering Method 

Event ordering in natural language written texts is not a trivial task. Firstly, 
a process to identify events must be done. Then, the relationship between the 
events or between the event and the date when the event occurs must be iden- 
tified. Finally, the ordering of events must be determined according to their 
temporal information. This temporal information could be dates, temporal ex- 
pressions or temporal signals. We have trivialized the task of identifying events. 
We only will identify events as the sentence that includes some kind of TE or a 
sentence that is related to another sentence by a temporal signal. 

Using the XML tags and the ordering keys, the event ordering module runs 
over the text building a table that specify the order and the date, if there is any, 
of every event. The order is established according to the following rules: 

1. EV1 is previous to EV2: 

- if the range of VALDATE1, VALTIME1, VALDATE2, VALTIME2 as- 
sociated with EV1 is prior to and not overlapping the range associated 
with EV2. 
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In December 1, the French 
bathyscaphe Nautilus arrives at the 
Galician coast, previously there were 
some cracks. 



Text 



TEMPORAL 

INFORMATION 

DETECTION 



TEMPORAL SIGNAL: 

previously 



TEMPORAL EXPRESSION: 

In December 1 



ORDERING 

KEY 

OBTAINING 



TEMPORAL 

EXPRESSION 

COREFERENCE 

RESOLUTION 



ORDERING KEY 

event 1 > event 2 



II 



EVENT ORDERING 



T.E. TAG: 

J <DA TE_TIME_REF 
VALDATE1 = “12/01/2002 ” 
in December 1 
</DATE TIME RE F> 



Order 


Event 


Date 


1 


There were some cracks 


«< 12/01/2002 


2 


The French bathyscaphe 
Nautilus arrives at the 
Galician Coast 


12/01/2002 



Fig. 2. Graphical example of Event Ordering 



— or, if the ordering key that relate both events is: 

EVKEV2 

2. EV1 is concurrent to EV2: 

— if the range of VALDATE1, VALTIME1, VALDATE2, VALTIME2 asso- 
ciated with EV1 overlaps the range associated with EV2. 

— or, if the ordering key that relate both events is: 

EV1=EV2 or EVli<=EV2<=EVlf 

The system will assign a sequential order number to every event in the table, 
having the same order number for concurrent events. 

In Figure 2 an example is shown. Newspaper’s date:30/12/2002 

4 Application of Event Ordering in NLP Tasks 

Event Ordering can be applied in different tasks in the field of Natural Lan- 
guage Processing. Some of the applications in which event ordering is useful are 
for example: Summarization, Question Answering, etc. In particular, we have de- 
veloped a method to apply Event Ordering to Temporal Question Answering[8]. 
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Temporal Question Answering is not a trivial task due to the complexity 
that temporal questions can achieve. Current Question Answering systems can 
deal with simple questions requiring a date as answer or questions that use 
explicit temporal expressions in their formulation. Nevertheless, more complex 
questions referring to the temporal properties of the entities being questioned and 
their relative temporal ordering in the question are beyond the scope of current 
Question Answering systems. These complex questions consist of two or more 
events, related with a temporal signal, which establishes the order between the 
events in the question. This situation allows us to divide, using Temporal Signals, 
the complex question into simple ones, which the current Question Answering 
systems are able to resolve, and recompose the answer using the ordering key to 
obtain a final answer to the complex question. 

An example of the how the system works with the question: Where did Bill 
Clinton study before going to Oxford University? is shown here: 

1. First of all, the unit recognizes the temporal signal, which in this case is 
before. 

2. Secondly, the complex question is divided into simple ones. 

— Ql: Where did Bill Clinton study? 

— Q2: When did Bill Clinton go to Oxford University? 

3. A general purpose Question Answering system answers the simple questions, 
obtaining the following results: 

— Answer for Question 1: Georgetown University (1964-1968) 

— Answer for Question 1: Oxford University (1968-1970) 

— Answer for Question 1: Yale Law School (1970-1973) 

— Answer for Question 2: 1968 

4. All those answers that do not fulfill with the constraint established by the 
ordering key are rejected. 

5. After that, the final answer to complex question is Georgetown University. 

5 System Evaluation 

In order to carry out an evaluation of this system, a manual annotation of texts 
has been made by two annotators with the purpose of comparing it with the 
automatic annotation that produces the system. For that reason, it is necessary 
to confirm that the manual information is trustworthy and it does not alter the 
results of the experiment. Carletta [2] explains that to assure a good annotation 
is necessary to make a series of direct measurements that are: stability, repro- 
ducibility and precision, but in addition to these measurements the reliability 
must measure the amount of noise in the information. The authors argue that, 
due to the amount of agreement by chance that can be expected depends on 
the number of relative frequencies of the categories under test, the reliability for 
the classifications of categories would have to be measure using the factor kappa 
defined in Siegel and Castellan [12]. The factor kappa (k) measures the affinity 
in agreement between a set of annotator when they make categories judgments. 
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In our case, there is only one class of objects and there are three objects 
within this class: objects that refer to the date of the article, objects which refer 
to the previous date and objects that that refer to another date different from 
the previous ones. 

After carrying out the calculation, a value fc=0.953 was obtained. According 
to the work of Carletta [2], a measurement of k like 0,68 < k < 0,8 means 
that the conclusions are favorable, and if k > 0,8 means total reliability exists 
between the results of both annotators. Since our value of k is greater than 0,8, 
it is guaranteed that a total reliability in the conducted annotation exists and 
therefore, the results of obtained precision and recall are guaranteed. 

In order to evaluate the event ordering method, an evaluation of the TERSEO 
system in a monolingual level (Spanish) was carried out. The establishment of 
a correct order between the events implies that the resolution is correct and the 
events are placed on a timeline, as it is shown in Figure 3. For this reason, we 
have made an evaluation of the resolution of the Temporal Expressions. 



EVENTS AND ITS TEMPORAL EXPRESSIONS 
1 -EVENT 1: Jan. 1, 1967 | 

■ -EVENT 2: a year later I 

I 

I -EVENT 3: two months before j 

EV1 EV3 EV2 



01/01/1967 10/01/1967 01/01/1968 



Fig. 3. Event Ordering based on TE resolution 



Two corpora formed by newspaper articles in Spanish were used. The first 
set has been used for training and it consists of 50 articles. Thus, after making 
the opportune adjustments to the system, the optimal results of precision and 
recall obtained are in the table 3: 

Although the obtained results are highly successful, we have detected some 
failures that have been deeply analyzed. As can be observed in the results, our 
system could be improved in some aspects. Below, a study of the problems 
detected and their possible improvements are shown: 

— In the newspaper’s articles, sometimes there are expressions like “el sabado 
h.nbo cinco accidentes” (Saturday there were five accidents). To resolve these 
kind of references we need context information of the sentence where the 
reference is. That information could be the time of the sentence’s verb. If 
the verb is a past verb, it indicates that it is necessary to solve a reference 
like “el sabado pasado” (last Saturday), whereas if it is a future verb it refers 
to “el sabado proximo” (the next Saturday). Because our system does not 
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Table 3. Evaluation of the system 





TRAINING 


TEST 


No Art. 


50 


50 


Real Ref 


238 


199 


Treated 

Ref. 


201 


156 


Successes 


170 


138 


Precision 


84% 


91% 


Recall 


71% 


73% 


Coverage 


84% 


80% 



use semantic or context information we assume this kind of reference refers 
to the last day, not the next, because the news usually tells us facts which 
occurred previously. 

— Our system is not able to resolve temporal expressions that contain a well- 
known event, for instance: “two days before the war... ”. In order to solve this 
kind of expressions, some extra knowledge of the world is necessary, and we 
are not able to access this kind of information nowadays. 

6 Conclusions 

In this article a method of event ordering has been presented. This method 
is based on the detection of keywords (temporal signals) and the resolution of 
the temporal information associated to the event. This method can be applied 
to multilingual texts because TERSEO system solves the temporal expressions 
in this type of texts. The obtained results show that this tool can be used to 
improve other systems of NLP as for example: Question Answering systems, 
with questions of temporal information or Summarization systems. Nowadays, an 
application of this work is being used applied to a Question Answering system[8j. 

As future work, two tasks will be considered: 

— The system will cope with the resolution of temporal expressions considering 
context information or world knowledge. 

— An evaluation of TERSEO system in a multilingual level is being prepared. 
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Abstract. Many queries processed on the World Wide Web do not return the de- 
sired results because they fail to take into account the context of the query and in- 
formation about user’s situation and preferences. In this research, we propose the 
use of user profiles as a way to increase the accuracy of web pages returned from 
the Web. A methodology for creating, representing, and using user profiles is pro- 
posed. A frame-based representation captures the initial user's profile. The user’s 
query history and post query analysis is intended to further update and augment 
the user’s profile. The effectiveness of the approach for creating and using user 
profiles is demonstrated by testing various queries. 



1 Introduction 

The continued growth of the World Wide Web has made the retrieval of relevant 
information for a user’s query difficult. Search engines often return a large number of 
results when only a few are desired. Alternatively, they may come up "empty” be- 
cause a small piece of information is missing. Most search engines perform on a syn- 
tactic basis, and cannot assess the usefulness of a query as a human would who un- 
derstands his or her own preferences and has common sense knowledge of the real 
world. The problems associated with query processing, in essence, arise because 
context is missing from the specification of a query. The Semantic Web has been 
proposed to resolve context problems by documenting and using semantics [1]. This 
research investigates how user profiles can be used to improve web searches by incor- 
porating context during query processing. The objectives are to: 1) develop a heuris- 
tics-based methodology to capture, represent, and use user profiles; 2) incorporate the 
methodology into a prototype, and 3) test the effectiveness of the methodology. The 
contribution of the research is to help realize the Semantic Web by capturing and 
using the semantics of a query through user profiling. Our research contributes, not 
through a new approach to building, representing, or using a user profile, but by show 
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ing how user profiles can be used as part of an integrated approach that utilizes lexi- 
cal, ontological, and personal knowledge. 



2 Related Research 

Information retrieval involves four steps: 1) understanding the task, 2) identifying the 
information need, 3) creating the query, and 4) executing the query [2]. This research 
focuses on the third step, namely, query creation. 



Scope of research 




User Interface Index Document Collection 



Fig. 1 . High-level Architecture and Process of a Web Search 

In Figure 1 a user sends a query to a search engine using keywords and syntactical 
operations. The search engine is a black box: a user submits a query and the search 
engine lists the results. This search engine remains a black box in this research as it 
focuses on refining the query before it reaches the interface. Refinement is important 
because users submit short queries that return many irrelevant results [12]. A major 
challenge is contextual retrieval: combining search technologies and contextual 
knowledge to provide the most appropriate answer. One obstacle to contextual re- 
trieval is the lack of intelligence in Web-search systems. The Semantic Web is impor- 
tant in achieving contextual retrieval because terms on Web pages will be marked up 
using ontologies that define each term’s meaning. It will be a long time, however, 
before markup on pages becomes the norm. Thus, this research provides a methodol- 
ogy for using contextual knowledge to improve query refinement to increase the 
relevance of results on the current and Semantic Web. 

Context is: (1) the parts of a discourse that surround a word/passage and can shed 
light on its meaning, and (2) the interrelated conditions in which something exists 
(Merriam-Webster). For web queries, the first definition refers to a query’s lexical 
context; the second to the user submitting the query. By considering both, we aim to 
achieve a fully contextualized query. A query is relevant if it satisfies a user's infor- 
mation need (Borlund, 2003). Thus, a contextualized query returns relevant results by 
accounting for (a) the meaning of query terms and (b) the user’s preferences. An 
optimal contextual query will minimize the distance between the information need, I, 
and the query, Q. Distance (I->Q) is minimized by Min (D c , D p , D L ), where: 

• D c = use of the wrong concepts in the query to represent the information need 

• D p = lack of preferences in the query to constrain the concepts requested 

• D L = lack of precision in the language used in the query terms 

Why would a user write a query with positive levels of D c , D p , D L ? We postulate: 

• Postulate 1: D c and D p will be higher when a user lacks domain knowledge. 
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• Postulate 2: D p will be higher when a user lacks knowledge of preferences. 

• Postulate P3: D L will be higher when a user lacks lexical knowledge. 

Prior research suggests three techniques for minimizing D c , D p , D L : 

• Ontologies : Ontologies consist of terms, their definitions, and axioms relating 
them [7]. They can minimize D c by helping users understand relationship be- 
tween concepts. For example, to find a job as a Professor, an ontology might 
suggest relevant related terms, such as teaching and research. 

• Lexicons : Lexicons comprise the general vocabulary of a language. Lexicons 
can be used to minimize D L by identifying a term’s meaning by its connection 
to other terms and by selecting terms with minimal ambiguity. For example, 
in finding out about being a Professor, a lexicon might suggest that the user 
avoids using terms such as ‘Chair’ that have multiple meanings. 

• User Profiles : User profiles are a way of stating preferences about a concept. 
They can minimize D p by serving as a constraint on the range of instances that 
will be retrieved by the query. For example, a profile could limit one’s query 
for Professorial jobs to positions in the United Kingdom. 

Our prior research developed a Semantic Retrieval System (SRS) [3] (Figure 2) that 
uses lexical sources from WordNet and ontological sources from the DARPA ontol- 
ogy library to parse natural language queries and create contextualized queries. Given 
a set of terms, the system expands the query using lexically and domain-related terms 
to contextualize the query. The query terms form root nodes in a semantic network 
that expands by adding related terms and shrinks by removing terms of unwanted 
context, iterating towards an effective query. The system does not include user pro- 
files, so it can not create fully contextualized queries. This research attempts to 
achieve this by capturing users’ preferences via a user profile module. 




Fig. 2. Query Expansion and Semantic Web Retrieval Architecture 



3 Integrating User Profiles into Query Contextualization 

Consider the query in Table 1. To retrieve relevant results, much personal information 
is needed; merely parsing for key words will be not be sufficient. Given that most 
users have preferences when querying (per Table 1), it may be unclear why a user 
would exclude them from a query. Postulate 2 suggests two reasons: 
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Table 1 . Sample Query (adapted from [1]) 



Natural Language Query 


Personal Information Needed 


Mom needs to have a series of physical 
therapy sessions. Biweekly or some- 
thing.... Set up the appointments. 


Prescribed treatment, Preferred providers. Insur- 
ance plan. Preferred distance from home. Preferred 
quality of service. Preferred appointment times. 



1. Lack of domain knowledge: the user lacks knowledge of the domain and, thus, 
about possible preferences (e.g., the user does not know that physiotherapists are 
covered by insurance, and so does not include her/his insurance preferences). 

2. Delegated query: the user (or agent) is submitting a query on behalf of another 
user (e.g., ‘Mom’ in Table 1) and does not know that user’s preferences. 

Considering postulates 1-3 together yields a decision matrix (Table 2). 

Table 2. Applicability of Contextual Knowledge Source for Search Query* 



Domain 

Expert 



Domain 

Novice 



The Semantic Retrieval System considers the conditions in Table 2 to determine what 
contextual knowledge to invoke, A user can manually select a level for each variable 
(domain expertise, query writer, and language certainty) or the system can infer a 
user’s level. The heuristics used for inferences are: 

• Domain expertise : A user is an expert if he or she submits the same query as an- 
other user listed in the system’s history as an expert. 

• Query writer: A user is the previous user if s/he submits a query that contains a 
term that is the same or one degree separated from a term in the prior query. 

• Language uncertainty : A user requires a lexicon if he or she submits a term that 
has a synonym with less than or equal to half the number of word senses. 

The Semantic Retrieval System uses different knowledge depending on the cell of 
the matrix that best describes the user, resulting in separate query processes and heu- 
ristics for each cell. As Table 2 shows, user profiles are only used when a query is 
delegated to another user/agent, or a user is a novice. 



Do Query Yourself Delegated Query 



Do not use ontology 

Use lexicon if uncertain/ambiguous 

Do not use personal profile 


Do not use ontology 

Use lexicon if uncertain/ambiguous 

Use user profile (personal) 


2 


Use ontology 

Use lexicon if uncertain/ambiguous 
Use user profile (stereotype) 


Use ontology 4 

Use lexicon if uncertain/ambiguous 
Use user profile (stereotype & personal 



* For simplicity, lexical knowledge is not shown as a third dimension 



3.1 Type of User Profile 

Researchers have been investigating how profiles could help queries for many years [9], 
but there remains no consensus on how to build, represent, or use a user profile [6], and 
the benefits have been inconsistent [10]. Table 3 summarizes the most well known ap- 
proaches and the approach that we took with the Semantic Retrieval System. 
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Table 3. Summary of User Profile Techniques 



Aspect of User 
Profile Research 


Approaches in Literature 


Example 

Reference 


Approach in 
SRS 


Types of user 
profiles 


Knowledge-based vs. behavior based 


[11] 


Knowledge- 

based 


Personal vs. stereotype vs. community 


[10] 


Personal and 
stereotype 


Approaches for 
creating user 
profiles 


Direct/explicit (need user interaction) vs. 
Indirect/implicit (infer from user behav- 
ior) 


[13], [4] 


Direct and indi- 
rect 


Approaches for 
representing user 
profiles 


Keyword-based (e.g., vector) vs. Knowl- 
edge-based (e.g., rules, cases, frames, 
graph, tables) 


[10] 


Knowledge- 
based (frames) 


Approaches for 
using user pro- 
files 


Statistical (e.g., cosine, Bayes) vs. 
Knowledge inference (e.g., rules, neural 
nets) 


[8] 


Knowledge- 
inference (rules) 



User profiles are typically knowledge-based or behavior-based [11]. Knowledge- 
based profiles reflect user’s knowledge in the form of semantics; behavior-based 
profiles store records of users actions, e.g., web sites visited. We use a knowledge- 
based approach because our objective is to reach a knowledgeable decision about the 
context of a user’s query. Another distinction among profiles is whether the prefer- 
ences are: 1) personal (i.e., individual); 2) stereotype (i.e., held by a class of individu- 
als), or 3) community (i.e., held by an entire community) [10]. This research uses 
personal and stereotype approaches only because community preferences are less 
likely to be useful for a specific user and context. The Semantic Retrieval System 
chooses a suitable profile based on a user’s level of domain knowledge. Although 
stereotypes can derive from any personal attribute, domain knowledge is a key one. 
Using domain knowledge requires the system to identify: (a) the domain(s) a user’s 
query pertains to, and (b) the user’s level of knowledge about the domain) s). As Ta- 
ble 2 shows, the system uses the following rules: 

• If the user has a high level of domain expertise: 

o If the user writes her/his own query, do not use any profile, 
o If the user has an agent write the query, use the individual’s personal profile, 

• If the user has a low level of domain expertise: 

o If the user writes her/his own query, use the ‘expert’ stereotype profile, 
o If the user has an agent write the query, use the user’s personal profile together 
with the ‘expert’ stereotype profile. 

These rules first assume that users with more domain expertise have more stable pref- 
erences for what they want. Consequently, expert users are less likely to need to know 
what other users think about a topic. Second, users would rather know the prefer- 
ences of experts than novices so a novice user uses the expert stereotype, rather than 
the novice stereotype. Our system uses these rules as defaults and users can change 
them. 
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3.2 User Profile Creation 

The Semantic Retrieval System uses personal construct theory to guide profile 
creation. This theory assumes that people use a set of basic constructs to comprehend 
the world. An individual’s construct system is personal because different people 
may use different constructs and may mean different things by the same constructs 

[5]. 

The system creates user profiles using both direct and indirect methods, as Table 3 
shows. Because users are typically expert in some topics and novice in others, most 
users will use a combination of methods to build their profiles. The direct method is 
used for topics in which a person is expert. This is because a domain expert is more 
likely to have stable preferences about a domain and be more able to explicate her/his 
preferences. The first step is for the system to infer or for the user to instruct the sys- 
tem that he or she is an expert on a topic. The system will then require the user to 
manually create a profile on that topic by completing a form detailing the main con- 
structs, key attributes of each construct, preferences about each attribute, and pre- 
ferred instances. For example, a person might have a topic (e.g.. Health), a construct 
(e.g., ‘Doctor’), attributes (e.g., ‘Location’), constraints (e.g.. Location should be 
‘near to Atlanta’), and preferred instances (e.g., Dr Smith). 

An indirect method is used to create novice profiles because the direct method is 
less likely to work if users’ preferences are unstable and difficult to explicate. The 
first step in the indirect method is for the system to infer a user’s topics and constructs 
by having the user complete practice queries for topics of interest. To expand the set 
of constructs and eliminate redundancy, the user’s query terms are queried against 
WordNet and DAML ontology library to find synonyms and hypernyms. The system 
expands the constructs list by having the user rank the relevance of page snippets 
returned from queries. Highly ranked snippets are parsed for additional terms to add 
to the construct list. After finalizing the construct list, the system searches for relevant 
properties of each one by querying the ontology library. It also queries existing pro- 
files from other users to find out if they use the same constructs and, if so, what 
properties they use. The system presents the user with a final list of properties from 
these sources (ontologies and existing user profiles) and asks the user to select 
relevant properties, add new ones if necessary, and rate their importance. Finally, 
the system requests that the user enter constraints for each property (e.g.. Location 
‘near to Atlanta’) as well as preferred instances, if any, of the construct (e.g., Dr 
Smith). 



3.3 User Profile Representation 

A frame representation is used to specify constructs (frames), attributes (slots), and 
constraints (slot values). There are two types of preferences as shown in Figure 3: 

• Global constraints apply to all queries (e.g. pages returned must be in English). 

• Local constraints apply based on the query context (location may be a relevant 

constraint for restaurants but not for online journals). 
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Frame: Global Constraint 


Frame: Restaurant 




Slot Name: Language 


Slot Name: Food_type 


Value: Vegetarian 


Slot Value: English 


Slot Name: Cuisine 


Value: Italian 


Slot Name: Domain to search 


Slot Name: Location 


Value: Atlanta 


Slot Value: .com 


Slot Name: Price 


Value: Inexpensive 



Fig. 3. Global and Local Constraints 



3.4 User Profile Use 

We distinguish between informational and action queries. The query: Mom needs to 
have a series of physical therapy sessions. Biweekly or something.... Set up the ap- 
pointment is an action query because it asks the ‘agent’ to ‘set up’ appointments. Our 
research is restricted to informational queries that simply retrieve information. Re- 
call, from Table 2, that the Semantic Retrieval System follows one of four approaches 
depending on a user’s level of domain and lexical knowledge, and whether the user 
writes or delegates his/her query. The first step is to select the appropriate approach. 

Stepl: Identify approach : 

• Heuristic 1 : Domain expertise: 

Assume that no query has yet been submitted. The user manually indicates his / 
her level of domain knowledge. Assume the user has high domain knowledge. 

• Heuristic 2: Query writer. 

No query has yet been submitted, so the user indicates whether s/he is the query 
originator. This is a delegated query for ‘Mom,’ so it resides in Cell 2, Table 2. 

• Heuristic 3: Language uncertainty: 

A user requires a lexicon if he or she submits a term that has a synonym with 
less than or equal to half the number of word senses. We first parse the query for 
stop words (via the list at http://dvl.dtic.mil/stop_list.pdfJ, action words (e.g., 
set), and identify phrases via WordNet. The query becomes: 
series “physical therapy” sessions biweekly appointments. 

Querying each term against WordNet, shows no synonyms or fewer word 
senses than its synonyms of the same sense. Thus, it can be inferred that the user 
has high language certainty. 

The result of this step is that the system emphasizes user profiles over ontologies 
or lexicons to construct the query (see Table2, cell 2). 

Step 2: Identify appropriate type of user profile: 

The appropriate type of profile is a personal profile because the query has been 
delegated from an expert user. 

Step 3: Identify terms from profile to increase query precision: 

The system queries the user’s personal profile to identify relevant constraints. As- 
sume ‘Mom’ had entered a profile with the preferences listed in Table 1. Figure 4 
illustrates the resulting profile: 
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Frame: Global Constraint 
Slot Name: Language 


Frame: Physical Therapist 
Slot Name: Treatment 


Value 


Ultrasound 


Slot Value: English 


Slot Name: Provider 


Value 


PSS Injury 


Slot Name: Domain to search 


Slot Name: Distance 


Value 


Atlanta 


Slot Value: .com 


Slot Name: Service Level 


Value 


High quality 




Slot Name: Appointment times 


Value 


Morning 



Fig. 4. Profile for ‘Mom’ in Berners-Lee et al. Query 



Given the query ( series “physical therapy” sessions biweekly appointments), this 
step searches the profiles for any frame matching a query term. Only ‘Physical ther- 
apy’ matches a frame in the profile so the query becomes series “physical therapy” 
ultrasound PSS injury Atlanta high-quality morning sessions biweekly appointments. 

Step 4: Execute query: 

The system executes the expanded query. A manual test on Google returns zero 
results so the following heuristic is invoked. 

• Heuristic 4: Term reduction: 

If fewer than ten pages are returned and they do not include a relevant result, re- 
specify the query by removing constraints of lesser weight. Assume that users 
list attributes of constructs in descending order of importance. The query is it- 
eratively re-run after deleting a term from those listed last in the profile. The 
query becomes: 

“physical therapy" ultrasound PSS injury Atlanta 
which returns two pages on Google: 

- www.pssinjurycenter.com/PSSpatientForms.doc 

- www.spawar.navy.mil/sti/publications/ pubs/td/3138/td3 138cond.pdf 

After accounting for the global constraint, the second result is removed, leaving 
one page: www.pssinjurycenter.com/PSSpatientForms.doc. This is a relevant 
page for the query because it is the user’s preferred provider. 

Step 5: Obtain feedback: 

The user indicates whether the query was a success or a new query one is 
needed. 



4 Prototype 

A prototype of the profile module is being developed (Figure 5). Users specify 
queries in natural language. Java application code parses queries, inferences, adds 
context information, and constructs the search engine queries. The application inter- 
faces with repositories that store User Profile and History information, lexicons 
(WordNet) and domain ontologies (DAML). The system interfaces with Google and 
Alltheweb. 
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Fig. 5. Semantic Retrieval System Prototype Design 

There are three modules. The Interface & Parse Module captures the user’s query, 
parses it for nouns (noun phrases) and returns the part-of-speech for each term. From 
this, a baseline query is created. The Local Knowledge Base & Inference Engine in- 
terfaces with WordNet. Synonyms are first obtained for each noun in the initial 
query. Based on the user’s selected synset, hypernyms and hyponyms for the selected 
sense of the term are obtained from the lexical database and the DAML library. The 
Local Knowledge Base & Inference Engine module is being implemented using Java 
Expert System Shell, Jess, (http://herzberg.ca.sandia.gov/jess). When processing a 
query, the local knowledge base is populated with domain knowledge and user pref- 
erences, expressed as Jess facts. For example, the domain knowledge base is instanti- 
ated with synonyms, hypernyms and hyponyms selected by the user. Jess rules reason 
about the context and retrieve relevant preferences using the base terms and syno- 
nyms, hypernyms and hyponyms. This contextual information is added to the knowl- 
edge base to identify the user’s preferences; e.g., if the context is restaurants, then, the 
following rule will retrieve restaurant preferences. 

Jess> (defrule get_restaurant preferences "Get slots defined for food preferences) 

(phase: identifying preferences) 

?cntxt <- (query-context restaurant) 

(restaurant preferences (food_type ?x) (cuisine ?y) (domicile ?z) (driving_radius ?r)) => 

(retract ?cntxt) 

(printout t "Restaurant Preferences" crlf) 

(printout t “FoodType: ” ?x “ Cuisine: ” ?y “ Domicile: ” ?z “ Driving Radius: ” ?r crlf) 
(assert (Restaurant preferences retrieved ))) 

The Query Constructor expands the query, by adding synonyms, negative knowl- 
edge, hypernyms, and/or hyponyms, and personal preferences using several heuris- 
tics. The expanded query is submitted to the search engine and the results returned. 



4.2 Sample Scenario 

Assume a novice user enters the delegated query: “Find a restaurant for dinner.” After 
parsing the user’s input, the terms are displayed and the user can uncheck irrelevant 




60 



V.C. Storey, V. Sugumaran, and A. Burton-Jones 



ones. The Inference Engine retrieves the word senses for each term from which the 
user identifies the appropriate sense. If none are selected, the module uses the query’s 
base term; e.g., “dinner” has two senses. After selecting the appropriate word sense, 
the user initiates query refinement. For each term, the hypernyms and hyponyms 
corresponding to its word sense are retrieved from WordNet and the DAML ontolo- 
gies. The user can identify hypernyms or hyponyms that best matches the query’s 
intention to be added; e.g., the user might select “cafe.” Additional contextual infor- 
mation is generated and used to search the user profile for preferences. For example, 
based on “cafe” and “meal”, the system infers that the user is interested in having a 
meal in a restaurant and searches the user profile to retrieve preferences for restau- 
rants and food habits. The restaurant preference frame shows that the user likes Indian 
Vegetarian food and does not like to drive more than 30 miles. The user lives in At- 
lanta. All preference information are added, but the user might not want to use some 
of it. (E.g., the user may relax the driving distance restriction.) The inference engine 
adds the additional information and the Query Constructor module generates the final 
query. Based on the word sense for each term, a synonym may be added with an OR 
condition. For example, “restaurant” is expanded with the first term in the user’s 
selected synset (i.e., restaurant OR “eating house”). Negative knowledge is also added 
from the next highest synset that has a synonym to exclude unrelated hits. The user 
selected hypernyms and hyponyms are added as is the user preference information. 
For example, the Indian Vegetarian and Atlanta are added. The final refined query is 
then submitted to the search engine (Figure 6). The refined query for the restaurant 
example using Google’s syntax is: 

( restaurant OR “eating house ’’) cafe “Indian vegetarian ” Atlanta dinner meal 
This query generates 79 hits. The user evaluates them by looking at the snippets or 
sites. Based on the snippets shown in Figure 6, the hits appear to be very relevant. 




Fig. 6. Results from Google Search Engine 
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5 Testing 

We manually simulated our approach to test whether it improves query relevance. 
We chose the context where user profiles should help most: when a domain novice 
delegates a query to another user (Table 2, cell 4). In this context, the system uses 
ontologies, lexicons, the novice’s personal profile, and a stereotype profile. A res- 
taurant domain was chosen and Figure 7 shows the user profiles and queries used 
for testing. One researcher created a personal profile and five queries while an- 
other created a stereotype profile for the restaurant domain. The stereotype profile 
was created based on the categories, attributes, and preferences listed in an online 
source of restaurant reviews ( http://www.accessatlanta.com ~). Consistent with per- 
sonal construct theory, the novice and expert frames differ in number of frames and 
types of slots. 



Personal (Novice) Profile: 


Stereotype (Expert) Profile: 


Frame: Restaurant 


Frame: Restaurant 


Slots: 


Values: 


Slots: 


Values: 


- Maximum commute time 


- Less than 30 minutes 


- Rating 


- Yes 


- Food 


- Prefer: American, French, 


- Credit card 


- Yes 




Canadian, Italian; Like: In- 
dian, Greek; Dislike: Sea- 
food, fish, ethnic food 

- 7pm reservation 

- Four 

- No live music. 


- Parking 


- Yes 




Frame: Fine Dining Restaurant 


- Dining time 

- Usual number in party 

- Live music 


Slots: 


Values: 


- Menu 

- Service 

- Dress Code 


- Excellent 

- Excellent 

- Formal 


Test Queries 




Frame: Good Value Restaurant 


1 . Find a restaurant that serves Canadian food. 


Slots: 


Values: 


2. Find a restaurant that serves PEI Mussels 


- Menu 


- Good 


3. Find a seafood restaurant that serves chicken 


- Service 


- Good 


4. Find a restaurant that has a take-out area 


- Dress Code 


- Smart Casual 


5. Find a medium-expensive Italian restaurant in Dunwood' 


- Price 


- Reasonable 



Fig. 7. Testing the query approach in the Restaurant domain 

Table 4 details the results. Together, the knowledge sources provided many 
terms. Clearly, some terms such as ‘30 minutes’ are less useful for a query. Our 
ongoing research is investigating whether inferences can be made about such 
statements to allow better query expansion. Nevertheless, by judiciously selecting 
terms from the list, we found that the relevance of each query could be improved, 
supporting the overall approach. 



6 Conclusion 

A methodology for incorporating user profiles into query processing has been proposed 
where user profiles are represented as fames. The profile information is applied to que- 
ries to improve the relevancy of the query results obtained. The methodology is being 
implemented in a prototype. Testing of the methodology shows that the current results 
are promising. Further research is needed to formalize the methodology, complete the 
prototype, ensure scalability, and further validation. The techniques developed for cap- 
turing, representing and using the user profile knowledge also need to be expanded to 
action queries. 
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Table 4. Summary of Testing Results 



c 


Parsed 

Terms 


Additional Query Terms Presented to User From... 


Relevance 
of Top 10 


1 


Restaurant 


Lexicon: “serve up,” provide 


4 of 10 




serves Cana- 


Ontologies: cafe, “French Canadian” 


(Alltheweb); 




dian food 


Profiles: less than 30 minutes, 7pm reservation, four people, 
-“live music”, rating, credit card, parking 


6 of 10 
(our system) 


2 


Restaurant 


Lexicon: "serve up,” provide 


2 of 10 




serves PEI 


Ontoloqies: cafe 


(Alltheweb); 




Mussels 


Profiles: less than 30 minutes, 7pm reservation, four people, 
-“live music”, rating, credit card, parking 


5 of 10 
(our system) 


3 


Seafood 


Lexicon: "serve up,” provide, poulet, volaille 


4 of 10 




restaurant 


Ontologies: poultry, “Fish Food” 


(Alltheweb); 




serves 


Profiles: less than 30 minutes, (American or French or Ca- 


7 of 10 




chicken 


nadian), -Seafood, -fish, -“ethnic food”, 7pm reservation, four 
people, -“live music”, rating, credit card, parking 


(our sys- 
tem) 


4 


Restaurant 


Lexicon: None 


2 of 10 




takeout area 


Ontologies: cafe 

Profiles: loss than 30 minutes, (American or French or 
Canadian), -Seafood, -fish, -“ethnic food”, 7pm reservation, 
four people, -“live music”, rating, credit card, parking 


(Alltheweb); 

5 of 10 
(our system) 


5 


Medium- 


Lexicon: average, moderate 


Oof 10 




expensive 


Ontologies: “European country” 


(Alltheweb); 




Italian restau- 


Profiles: less than 30 minutes, 7pm reservation, -Seafood, - 


6 of 10 




rant Dunwooc 


fish, -“ethnic food”, four people, -“live music”, rating, credit 
card, parking, good menu, good service, smart casual 


(our system) 



*The queries were run on the search engine http://www.alltheweb.com . Other search engines (e.g., Goo- 
gle) could not be used as they restrict the number of query terms (e.g., to ten terms). 
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Abstract. We present a system for automatically annotating text with 
lexical semantic structures based on FrameNet, using a cascade of flat 
parsers with hand-crafted rules leading to a robust overall system. 

This translation process is eventually to be employed in a Question An- 
swering system to allow a more meaning-oriented search for answers: We 
want to explore the possibilities of a match of the FrameNet representa- 
tions of the document base with that of the user’s query. This will allow 
a more directed matching than the often-used approach of first retriev- 
ing candidate documents or passages with techniques based on surface 
words and will bring together ideas from Information Extraction, Ques- 
tion Answering and Lexical Semantics. FrameNet allows us to combine 
words in the text by ‘semantic valency’, based on the idea of thematic 
roles as a central aspect of meaning. This description level supports gen- 
eralisations, for example over different words of one word field. 

First results of the FrameNet translation process are promising. A large 
scale evaluation is in preparation. After that, we will concentrate on 
the implementation of the Question Answering system. A number of 
implementation issues are discussed here. 



1 Overview 

We present a system for automatically annotating German texts with lexical 
semantic structures, namely FrameNet, using a cascade of flat parsers. This 
process is eventually to form the core of a meaning-oriented Question Answering 
(QA) system where both the text corpus to be searched and the user’s questions 
to the system are automatically translated into FrameNet representations and 
matching is done directly on these structures. This is ongoing work within the 
Collate project: While the greater part of the FrameNet translation process has 
been implemented and is described here in more detail, the QA system is still 
in its design phase. We will give a sketch of some design issues. 

Most QA systems today use a different approach [1]: After processing the 
user’s question, a document (or sometimes passage) search is done using key- 
words from the question, often complemented by semantically related words 
(query expansion) . The actual search is thus done based on surface words, mostly 
using indexing techniques. The retrieved candidate documents or passages are 



F. Meziane and E. Metais (Eds.): NLDB 2004, LNCS 3136, pp. 64—75, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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then processed either statistically [2] or with ‘deeper’ linguistic methods [3,4,5]. 
The answer is generated from the passage that was rated most relevant, either 
as a text snippet or (if a linguistic analysis was performed) using generation 
techniques on the analysis results. 

Using surface word-based Information Retrieval (IR) techniques to find can- 
didate documents has become state-of-the-art after several attempts more or 
less failed to introduce ‘deep’ linguistic processing into QA in the 1970’s. The 
IR approach generally leads to very efficient processing. However, these systems 
have a number of potential disadvantages, such as imperfect precision and high 
reliance on answer redundancy [6]. 

In contrast, matching structured representations has in the past mainly been 
employed in natural language database front-ends (for an overview see [7]). These 
front-ends allow a user to query structured databases in natural language, ideally 
without any knowledge about the design or internal working of the database. 
Since the front-ends are mostly tailored to the respective task, they can reach 
a high overall performance. On the other hand, they are more or less domain 
dependent and do not scale up well in general. In addition, they are fitted to one 
specific database and cannot simply accommodate new, raw data (especially no 
unstructured text) . 

Another related field of work is Information Extraction (IE). Here, text col- 
lections are processed to fill pre-defined templates with specific information. This 
normally takes the form of identifying interesting entities (persons, organisations 
etc.), identifying relations between them (e.g., Employee.Of), and finally fill- 
ing general scenario templates (e.g., describing a launch of a satellite in detail). 
Well-known tasks in IE have been defined for the Message Understanding Con- 
ferences [8] . Both rule-based and machine learning approaches have been applied 
to these tasks in the past. The templates defined for IE tasks tend to be spe- 
cific and detailed, making it necessary to customise at least the template filling 
module for each domain. FASTUS is a prominent example of such a system [9]. 

We want to make use of the idea of pre-processing the text to enrich it 
with more structure. By doing the annotation process ‘off-line’, i. e. , at docu- 
ment indexing time, the actual search can efficiently be done at retrieval time 
over structured data. As basis for the annotation format, we have chosen the 
lexical semantic structures of FrameNet, using its concept of semantic valency 
to abstract away over linguistic issues on the text surface, such as wording or 
systematic differences such as active/passive. The idea is to use the translation 
process and the resulting FrameNet structures as a representation of both the 
texts that are searched and the user’s question to the system and to match these 
representations as part of a more meaning oriented search mechanism: If we 
know the exact relation between two entities in the text and can directly search 
that relation, then we can use this knowledge to answer questions about them 
in a more principled way than if we only observe that they co-occur in a text 
(as with bag-of- words techniques). 

Our goal thus lies between the pure bag-of-words-based search and the sys- 
tems for IE and natural language database front-ends described above in that it 
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will be as domain-independent as possible on the one hand and will, on the other 
hand, allow finding answers to questions in a more meaning-oriented way. Such a 
result will bring together IE and QA. This is achieved by using FrameNet struc- 
tures that describe general natural language structures and not templates that 
are defined by an underlying database or by a specific IE task: Sometimes, even 
human annotators seem to have had difficulties in filling IE scenario templates 
from given texts [8]. 

We believe that our approach, that is linguistically grounded and thus more 
focused on the language of the text collection than on abstract information struc- 
tures, can be more reliably derived automatically and still offers a reasonable 
structured representation to be searched. Our approach has the additional ad- 
vantage that an answer generation becomes comparatively easy: Through the 
retrieval process, we have direct access to a structured representation of the 
original text. This can, together with the answer type, be used to generate a 
natural language answer. 

This paper is organised as follows: First we give a short introduction to 
FrameNet and argue for the usefulness of a FrameNet annotation for linguistic 
tasks. We describe our system set-up for the derivation of FrameNet structures 
and give an example of the processing. Then we sketch some challenges that 
still must be solved in integrating this parser into a full-fledged QA system. 
Finally, we conclude by summarising the current state of the project and giving 
an outlook. 

2 FrameNet for Semantic Annotation 

FrameNet is a lexical database resource containing valency information and ab- 
stract predicates. English FrameNet is developed at the ICSI, Berkeley, CA 1 
[10,11]. Development of a German FrameNet is currently underway within the 
context of the SALSA project 2 [12]. Here, a large corpus of German newspaper 
texts is annotated with FrameNet structures in a boot-strapping fashion, begin- 
ning with manual annotation of sub-corpora, while techniques are developed for 
(semi-) automatic annotation that will eventually speed up annotation substan- 
tially. Thus, completing the FrameNet database with valency information and 
annotating the corpus proceed in an interleaved fashion. 

Intuitively, a frame in FrameNet is a schematic representation of a situation 
involving participants, props etc. Words are grouped together in frames accord- 
ing to semantic similarity, a number of frames forming a domain. For each frame, 
a small number of frame elements is defined, characterising semantic arguments 
belonging to this frame. For example, the English verbs buy and sell belong to 
frames in the Commerce domain. Among others, a frame element Buyer is 
defined for them. When annotating documents using FrameNet, a Buyer would 
always be labeled as such, no matter that its syntactic relation to the verb buy 

1 f ramenet . icsi .berkeley . edu/~f ramenet/ 

2 The Saarbriicken Lexical Semantics Annotation and Analysis Project, 

www. coli .uni-sb . de/lexicon/ 
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or sell may be different (e. g., subject vs. indirect object). A FrameNet represen- 
tation of ‘Peter sells the book to Mary for £2.’ is shown in Ex. 1. 



Example 1. 



COMMERCE_SELL 
Seller: Peter 
Goods: book 
Buyer: Mary 
Money: £2 



FrameNet is used here to describe semantic roles and thus the relations be- 
tween participants and objects in a situation as predicate-argument-structures. 
Words that are semantically similar will receive comparable descriptions with 
the same role labels. This not only holds for synonyms, but also for antonyms 
and converse relations (such as buy and sell), and across parts of speech. 

This way of semantic grouping differs from ontologies like WordNet and 
its German version, GermaNet [13], that mainly concentrate on notions of hy- 
ponymy and hypernymy. They do not contain information on the semantic va- 
lency of words, i. e. , the relations described above. However, semantic relations 
are central to capturing the meaning of texts, as argued above. 

FrameNet also differs from other representations of semantic valency like 
tectogrammatical structures [15] or PropBank [16]: In these approaches, the 
buyer in an act of selling would, e. g., be described either as the Addressee or 
the Actor by tectogrammatical structures, or as an ArgO or Arg2 by PropBank, 
depending on whether the words buy or sell are used ( Addressee/ Arg2 of the 
selling, but Actor/ ArgO of the buying). In FrameNet, the roles are characterised 
with respect to the frame, therefore the buyer is a Buyer, no matter which 
words are used. 

This representation is therefore especially suited for applications where the 
surface wording is less important than the contents. This is the case for Infor- 
mation Management systems: In IE and IR, especially in QA, it is far more 
important to extract or find the right contents; differences in wording are more 
often than not a hindrance in this process. 

For a QA system, we plan to use the FrameNet translation to annotate the 
text basis to be used for QA ‘off-line’. The FrameNet structures are then stored in 
a way that allows efficient access. In the actual query process, the user’s questions 
are again translated by the same FrameNet translation process, supported by a 
question type recognition module. 

Matching against FrameNet structures instead of just words in an index will, 
e.g., allow to find an answer to the question ‘Who bought Mannesmann?’, no 
matter if the relevant text passage originally reads ‘ Vodafone took over Mannes- 
mann in 2000. ’, ‘Mannesmann was sold to Vodafone. or ‘Vodafone’s purchase 
of Mannesmann. . . ’This will not only increase recall (because the wording must 
not be matched exactly), but also precision (because texts in which the words 
bought and Mannesmann co-occur by chance do not match) . 

As we have described above, the off-line indexing process described here is 
in principle an IE task. However, as it is based on the semantic valency of words 
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here and not on abstract IE templates and thus closer to the text itself, it is 
less specialised and domain-dependent and expected to scale up better. We have 
started our work on a limited domain (business news texts). Exemplary tests 
with other texts (law, politics) have shown, however, that the approach is not 
limited by the domain. 



3 Deriving FrameNet Structures from Text 

In the previous section, we have argued for the usefulness of a text representation 
based on semantic valency. We have claimed that such a representation may be 
automatically derived from texts, as it is closely enough related to the text. 
We have implemented a system for deriving a FrameNet structure from German 
texts, based on flat parsing modules. Section 4 gives an example of the processing. 

Our system for annotating German texts with FrameNet structures uses a 
cascaded parsing process. Different parsers build on the results of each other 
to finally assign FrameNet structures to the text. These parsers focus only on 
recognising certain linguistically motivated structures in their respective input 
and do not try to achieve spanning parses. All parsers may leave ambiguities 
unresolved in the intermediate results, so that a later processing step may resolve 
them. This general technique was introduced under the name of easy-first-parsing 
in [17]. It generally leads to a more robust overall system. 

The input text is first tokenised and analysed morphologically. We employ 
the Gertwol two-level morphology system that is available from Lingsoft Oy, 
Helsinki. Gertwol covers the German morphology with inflection, derivation and 
compounding and has a large lexicon of approximately 350,000 word stems. 

Next is a topological parser. German sentences have a relatively rigid struc- 
ture of topological fields ( Vorfeld, left sentence bracket, Mittelfeld , right sentence 
bracket, Nachfeld) that helps to determine the sentence structure. By determin- 
ing the topological structure many potential problems and ambiguities can be 
resolved: The overall structure of complex sentences with subclauses can be 
recognised and the main verbs and verb clusters formed by modal verbs and 
auxiliaries can be identified with high precision. The topological parser uses 
a set of about 300 lrand-crafted context-free rules for parsing. Evaluation has 
shown that this approach can be applied successfully to different sorts of texts 
with both recall and precision averaging 87% (perfect match) [18]. 

Named Entity Recognition (NER) is the next step in the processing chain. 
We use a method based on hand-crafted regular expressions. At the moment, 
we recognise company names and currency expressions, as well as some person 
names. Recognition of company names is supported by a gazetteer with sev- 
eral thousand company names in different versions (such as ABB , ABB Asea 
Brown Boveri and ABB Group). The NE grammars and the gazetteer have been 
developed as part of a multi-lingual IE system within our project [19]. In an 
evaluation, the module has reached state-of-the-art results (precision of up to 
96 %, recall of up to 82 % for different text sorts). 
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In the long run, we consider enhancing the NE recogniser by methods for 
the automatic recognition of new NE patterns. As there are no out-of-tlre-box 
NERs for German that we are aware of, and as systems based entirely on machine 
learning techniques as the ones used in the multilingual named entity recognition 
shared task at CoNLL 2003 [20] need a relatively large amount of annotated 
training data, we think that a bootstrapping approach will prove the best option. 
Several methods have been proposed for this. Among them, especially the ‘learn- 
filter-apply- forget’ method [21] has been successfully applied for German. 

Named entity recognition is followed by a chunker for noun phrases (NPs) 
and prepositional phrases (PPs). This chunker was originally built for grammar 
checking in German texts [22]. It uses extended finite state automata (FSA) for 
describing German NPs/PPs. Though simple German NPs have a structure that 
can be described with FSA, complex NPs/PPs show self-embedding of (in prin- 
ciple) arbitrary depth: Attributive adjectives, for example, can take arguments 
in German, leading to centre self-embedding. The extension has been added ex- 
actly to accommodate these structures. This allows our system to analyze NPs 
like [ das [vom Konkurrenten]pp iibernommene Untemehmen]^ |p (‘the by the 
competitor taken-over company’, the company taken over by the competitor). 

One important feature of our system is that the results of the NE recogniser 
directly feed into the NP chunker. This allows us to handle an NE as an NP or 
N', so that it can form a part of more complex NP. Among other phenomena, 
modification with adjectives and coordination can thus be handled. One example 
of such a complex NP: ein Konsortium aus [Siemens AGJjyp; und mehreren Sub- 
unternehmem (a consortium of Siemens Ltd. and a number of sub-contractors). 

We have evaluated our original NP chunker without NE recogniser by using 
1,000 sentences from the annotated and manually controlled NEGRA corpus 
[23] as a gold standard to evaluate the bracketing of NPs. Of the 3,600 NPs 
in this test corpus, our system identified 92 %. Most of the remaining 8 % were 
due to unrecognised named entities. Of the recognised NPs, the bracketing was 
identical with the NEGRA corpus in 71 % of the cases. This relatively low figure 
is caused mostly by the fact that post-nominal modifiers, in contrast to the 
NEGRA corpus, are not attached to the NP by our system (as this is done by 
later processing steps), accounting for about one third of the errors. 

The results of the previous stages are put together into one overall structure. 
We have called this structure PReDS (Partially Resolved Dependency Structure, 
[18]). PReDS is a syntacto-semantic dependency structure that retains a num- 
ber of syntactic features (like prepositions of PPs) while abstracting away over 
others (like active/passive). It is roughly comparable with Logical Form [5] and 
Tectogrammatical Structures [15]. PReDS is derived from the structures built 
so far using context-free rules. 

In a last step, the resulting PReDS structure is translated into FrameNet 
structures [24]. This translation uses weighted rules matching sub-trees in the 
PReDS. The rules can be automatically derived from a preliminary version of 
a FrameNet database containing valency information on an abstract level (e.g. 
using notions like deep subject to avoid different descriptions for active and 
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passive on the one hand, but keeping prepositions as heads of PPs on the other 
hand). The FrameNet coverage for German is yet exemplary, but will grow with 
the increasing coverage of the German FrameNet lexicon. 

One additional processing component has not yet been integrated into the 
system, namely an ontology providing a hierarchy of semantic sortal information 
that can be used for describing selectional preferences on frame elements. So far, 
we have found that the so-called ‘core frame elements’, that must necessarily be 
present, are in most cases well-identifiable in syntax as, for example, subjects 
or objects, whereas ‘non-core elements’ such as Time or Location can often 
not be recognised by syntactic features alone. For example, both Time and 
Location can be given in a PP,:„, as ‘in the last year’ (Time) or ‘in the capital’ 
(Location). Using sortal information (we plan to employ GermaNet) will help 
to resolve cases that so far must remain underspecified. 

So far, only walk-through evaluations for single sentences have been con- 
ducted for our system. In these, currently about three quarters of the sentences 
of a test corpus of business news receive a PReDS representation as a basis for the 
FrameNet annotation. For FrameNet frames, the core frame elements are mostly 
correctly assigned, whereas non-core elements are less successfully handled. 

A quantitative evaluation is planned to start shortly. We plan to focus on 
two ‘gold standard’ evaluations using annotated corpora. For the evaluation of 
the PReDS structures we are currently looking into the possibility of using as 
a gold standard the German TIGER newspaper corpus of 35,000 sentences that 
has a multi-level annotation including grammatical functions [25] . The evaluation 
would be based on the grammatical functions. A suitable scheme for dependency- 
oriented parser evaluation has been described in [26]. 

To evaluate the FrameNet annotation (i. e., end-to-end evaluation), we plan 
to use the finished parts of the SALSA corpus of FrameNet for German [12] in a 
similar way. Due to the still small coverage of the SALSA corpus, this may have 
to be complemented by manual annotation and evaluation of a certain amount 
of text. Based on the first walk-through evaluations, we would expect our system 
to reach precision and recall values around the 60 % mark. This would be on par 
with the results of related work, shortly described in the following. 

For English FrameNet, a system for automatically annotating texts with 
FrameNet structures has been described [27]. It uses machine learning techniques 
and thus needs to be trained with a corpus that is annotated with FrameNet 
structures. The authors report a recall of 61% and a precision of 65% for the 
text-to-frame element labelling task. As this approach needs a large amount 
of training data (the authors have used the FrameNet corpus with 50,000 la- 
beled sentences), we could not easily transfer it to German, where the SALSA 
FrameNet corpus is only now under development. 

An approach partly similar to ours has been described for annotating Czech 
texts in the Prague Dependency Tree Bank with tectogrammatical (i.e., seman- 
tic) roles [28] . The author uses a combination of hand-crafted rules and machine 
learning to translate a manually annotated syntactic dependency structure into 
the tectogrammatical structures. He reports recall and precision values of up to 
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100 % and 78 %, respectively, or, with different system settings of 63 % and 93 %. 
However, this approach does not start from text, and therefore only shows how 
well the translation from dependency structures to semantic roles can do. 

4 One Example 

As an example we have chosen a sentence originally taken from our corpus of 
German business news texts (Ex. 2). The parse results for the different layers 
are shown in Fig. 1. 

Lockheed hat von GroBbritannien den Auftrag fiir 25 
Lockheed has from Great Britain the order for 25 
Transportflugzeuge erhalten. 

Example 2. transport planes received. 

‘Lockheed has received an order for 25 transport planes from Great 
Britain.’ 




Fig. 1. Example sentence with parse results (simplified from Suddeutsche Zeitung, 2 
January 95) 



In constructing the PReDS, the different tools contribute in the following 
way: The topological parser is used to recognise the split verb hat. . . erhalten. 
The NE recogniser finds Lockheed , whereas the other NPs are detected by the 
NP chunker. Note that in the construction of the PReDS all uncertain decisions 
are postponed for later processing steps. For example, the PPs always receive 
low attachment by default: The PP/ur is thus attached to the preceding NP, 
though syntactically a PP/u r might also be a verb modifier, e.g., expressing a 
beneficiary (like a free dative) . 

A number of FrameNet structures are constructed based on the PReDS. Two 
of them are shown in Exs. 3 and 4. These structures are derived from the PReDS 
by lexical rules containing different possible syntactic patterns for the targets. 
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Example 3. 



Getting 
Target: erhalten 
Donor: Groftbritannien 
Recipient: Lockheed 
Theme: Auftrag 



Example f . 



Request 

Target: Auftrag 

Message: 25 Transportflugzeuge 



The underlying rule deriving the Getting frame is shown in Ex. 5. The 
valency information expressed in the rules allows the derivation process to repair 
incorrect default choices made during the PReDS construction. For example, a 
PP that received the standard low attachment might be changed from an NP 
modifier into a verb complement, given the correct verb valency. 



Example 5. 



Getting 

Target: erhalten-verb 
Recipient: Deep-Subject 
Theme: Deep-Object 
[Donor: PP-von] 



The process described here can only derive FrameNet relations that are ex- 
pressed syntactically in some way (e.g., verb-object-relation). The next step to 
be implemented is the merging of information from different basic frames. In this 
example one would like to infer that the order is given to Lockheed by Great 
Britain by transferring the information from the Getting to the Request. 

Our experience so far has shown that there are relatively many cases like the 
one shown in the example where not all frame elements are part of the target 
word’s syntactic valency domain. In these cases, information must be transferred 
from other frames. This can partly be compared with template fusion in infor- 
mation extraction. Some of the cases are very systematic. We will investigate 
both hand-coding and learning these rules from the annotated corpora. 

5 Implementation Issues of the QA System 

In the previous sections, we have described how FrameNet representations are 
derived from texts. We now turn to the question how these structures may 
actually be used in question answering. These issues are taken up in [29]. 

As both the document collection and the user’s question are translated into 
FrameNet structures, the actual matching should in principle be straightforward: 
In question processing, the ‘focus’ of the question (i. e., the word or concept that 
the user asked for) would be replaced by a wild-card. Then the search would 
be a lookup in the list of frames generated by the text annotation. Processing 
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the question to find the focus and expected answer type has proven to be an 
important issue in QA systems [4,30]. We plan to use patterns over FrameNet 
structures to match question types similar to the approaches described there. 

For the actual search, we need to store the resulting FrameNet structures in 
a way to allow efficient access, preferably in a relational database. This should 
ideally support searching for lryponyms and hypernyms, allowing, e. g., for a 
question that contains a buy not only to match the Commerce_buy frame, but 
also the more general Getting frame that contains words like obtain or acquire , 
and vice versa. This information is present both in FrameNet and GermaNet. 
Methods for mapping such matches onto database queries have been imple- 
mented for XML structures [31]. This must be complemented by introducing 
underspecified ‘pseudo-frames’ for words and concepts that are not yet covered 
by FrameNet. In cases where a question is translated into more than one relation 
(e. g., ‘What is the size of the order given to Lockheed by Great Britain?’, where 
first the order must be identified and then its size must be found), this should 
in general translate into a database join. However, there will probably be cases 
where a directed inference step is required to bridge the gap between question 
and text representation. 

Introducing a directed inferencing step in this process would help to find 
answers that would otherwise be lost (especially in cases of a mismatch in the 
granularity between question and text representation, as mentioned above). Ex- 
amples of directed inferences are discussed in [3] . We are currently investigating 
the possibility of automatic inferencing over the FrameNet structures similar to 
the approach taken there. 



6 Conclusions and Outlook 

We have presented a system for automatically adding FrameNet structures to 
German texts using a cascade of flat parsers. Distributing the task of parsing over 
different parsers each handling only one single linguistic layer helps to make our 
approach robust and scalable. We have argued that FrameNet annotation adds a 
level of information that is especially useful for tasks like IE/IR, as semantically 
related words belonging to the same frame receive a comparable representation. 
We plan to use our system in a QA system using direct matching of FrameNet 
representations of both the user’s question and the texts to be searched. After 
the system evaluation described in Sec. 3, we plan to implement the QA system 
described in Sec. 2. Several open questions have been presented in Sec. 5. 

Other envisaged uses include annotating Intranet/Internet pages with Frame- 
Net annotation and then translating the resulting structures into one of the 
formalisms associated with the Semantic Web. It has been shown for English 
FrameNet that this translation is relatively straightforward [32]. This is, however, 
constrained by annotation speed: At the moment, it takes a few seconds to 
annotate complex sentences. This will be sufficient to annotate collections of 
tens of thousands of documents (provided that the rate of document change is 
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comparatively low). It will, however, not yet be suitable for a reasonably up-to- 
date Internet service with millions of pages. 

An important part of the development of the overall QA system described in 
Sec. 2 will be an evaluation to see if a FrameNet based system can really help to 
improve the performance of a QA system compared to one using bag-of-words 
techniques measurably. This is described in more detail in [33]. 
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Abstract. Most natural language database interfaces suffer from the translation 
knowledge portability problem, and are vulnerable to ill-formed questions be- 
cause of their deep analysis. To alleviate those problems, this paper proposes a 
lightweight approach to natural language interfaces, where translation knowl- 
edge is semi-automatically acquired and user questions are only syntactically 
analyzed. For the acquisition of translation knowledge, first, a target database is 
reverse-engineered into a physical database schema on which domain experts 
annotate linguistic descriptions to produce a pER (physically-derived Entity- 
Relationship) schema. Next, from the pER schema, initial translation knowl- 
edge is automatically extracted. Then, it is extended with synonyms from lexi- 
cal databases. In the stage of question answering, this semi-automatically con- 
structed translation knowledge is then used to resolve translation ambiguities. 



1 Introduction 

Natural language database interfaces (NLDBI) allow users to access database data in 
natural languages [2], In a typical NLDBI system, a natural language question is 
analyzed into an internal representation using linguistic knowledge. This internal 
representation is then translated into a formal database query by applying translation 
knowledge that associates linguistic constructions with target database structures. 
Finally, the database query is delivered to the underlying DBMS. 

Translation knowledge, which is created for a new target database, is necessarily 
combined with linguistic knowledge. Previous works are classified according to the 
extent that these two types of knowledge are connected: monolithic, tightly coupled, 
and loosely coupled approaches. In monolithic approaches, anticipated question pat- 
terns are associated with database query patterns. Question patterns are defined at the 
lexical level, so they may over-generalize the question meaning. Tightly coupled 
approaches hard-wire translation knowledge into linguistic knowledge in the form of 
semantic grammars [6,13,12]. These two approaches require modification of the two 
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kinds of knowledge in porting to a new domain. So, NLDBI researchers have con- 
centrated on minimizing the connection of the two types of knowledge in order to 
improve transportability. 

Loosely coupled approaches entail further division. Syntax-oriented systems 
[3,4,8] analyze questions up to a syntactic level. Logical form systems [14,5,1] inter- 
pret a user question into a domain-independent literal meaning level. In these ap- 
proaches, translation knowledge is applied after analysis. Thus, transporting to new 
database domains does not need to change linguistic knowledge at all, only tailor 
translation knowledge to new domains. Even in this case, however, it is nontrivial to 
describe translation knowledge. For example, syntax-oriented systems have to devise 
conversion rules that transform parse trees into database query expressions [2], and 
logical form systems should define database relations for logical predicates. In addi- 
tion, creating such translation knowledge demands considerable human expertise in 
AI/NLP/DBMS and domain specialties. 

Moreover, in state-of-the-art logical form approaches, a question should success- 
fully undergo tokenization, tagging, parsing, and semantic analysis. However, these 
entire analysis processes are vulnerable to ill-formed sentences that are likely to occur 
in an interactive environment, where users are prone to type their request tersely. 

In order to automate translation knowledge acquisition, and to make a robust 
NLDBI system, this paper proposes a lightweight NLDBI approach, which is a more 
portable approach because its translation knowledge does not incorporate any lin- 
guistic knowledge except words. Note that, in loosely coupled approaches, translation 
knowledge encodes some linguistic knowledge, such as parse trees, or logical predi- 
cates. Our lightweight approach features semi-automatic acquisition and scalable 
expansion of translation knowledge, use of only syntactic analyzers, and incorpora- 
tion of information retrieval (IR) techniques for conversion of question nouns to do- 
main objects. 

The remainder of this paper is as follows. The next section defines the terminolo- 
gies used in this paper. Section 3 describes translation difficulties in NLDBI, and 
Section 4 through 6 explains a lightweight NLDBI approach using examples. Discus- 
sions and concluding remarks are given in section 7. 



2 Terminologies 

In this paper, a domain class refers to a table or a column in a database. A domain 
class instance indicates an individual column value. For example, suppose that a 
database table T_customer has a column C_name. Then, T_customer and C_name are 
called domain classes. If C_name has ‘ Abraham Lincoln’ as its value, this value is 
called a domain class instance. A class term is defined as a lexical term that refers to 
a domain class. A value term means a term that indicates a domain class instance. 
For instance, the word ‘ customer ’ in a user question is a class term corresponding to 
the above domain class T_customer. The word 'Lincoln’ is a value term referring to 
the above domain class instance ‘Abraham Lincoln ’ . 
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3 Translation Difficulties in NLDBI 

When translating a question noun into its correct domain class, main difficulties are 
due to paraphrases and translation ambiguities, which correspond to M-to-1 and 1-to- 
N mappings between linguistic expressions and domain classes (or domain class in- 
stances), respectively. For example, several different paraphrases (e.g., professor, 
instructor , a person who teaches, etc.) may refer to the same domain class, and the 
same linguistic expression (e.g., year) may refer to many different domain classes, 
such as entrance year, grade year, etc. 

In NLDBI, paraphrases can occur as class terms or value terms. Class term para- 
phrases correspond to synonyms or hypernyms. Value term paraphrases tend to occur 
in abbreviations, acronyms, or even substrings. For example, MS, JFK, and Boeing 
can refer to 'Microsoft’ , ‘ John F. Kennedy airport’ , and ‘ Boeing xxx-xxx’ , respec- 
tively. So we should support partial matching between question value terms and do- 
main class instances. In other words, the translation module of an NLDBI system 
needs to secure all possible paraphrases to each domain class or domain class in- 
stance. In order to handle this paraphrase problem, our lightweight approach employs 
both direct and indirect methods, in Section 5.2 and 6.2.1, respectively. 

There are two types of translation ambiguities. Class term ambiguity occurs when 
a class term in a question refers to two or more domain classes. This ambiguity 
mostly results from general attributes that several domain entities share. Value term 
ambiguity occurs when a value term corresponds to two or more domain class in- 
stances. Date or numeric expressions may almost always cause value term ambiguity. 
This paper attempts to resolve translation ambiguities using selection restrictions, 
such as valency information and collocations, in Section 6.2.2. 
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Fig. 1. Lightweight natural language database interfaces (NLDBI) architecture 
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4 Lightweight NLDBI Architecture 

Fig. I shows the major steps of a lightweight NLDBI architecture. Currently, this 
architecture is for relational databases, and uses SQL as a formal database query lan- 
guage. There are two processes: domain adaptation and question answering. For a 
new database domain, domain adaptation semi-automatically constructs translation 
knowledge for the subsequent question answering process. 

During domain adaptation, initial translation knowledge is first automatically cre- 
ated from both natural language descriptions manually annotated on a physical data- 
base schema and entire database values. Then, domain adaptation process continues 
to automatically adapt the initial translation knowledge into a target database domain 
using domain-dependent texts, dictionaries, and corpora. 

In a question answering process, a user’s natural language question is first syntac- 
tically analyzed into a set of nouns and a set of predicate-argument pairs. Question 
analysis also yields a set of feature-value pairs for each question noun. Next, for each 
question noun, noun translation finds its corresponding domain class, which is then 
marked on a physical database schema (Physical ER schema in Fig. 1). Finally, from 
the marked domain classes and question feature-value pairs, a formal database query 
is generated. 

Domain adaptation and question answering are exemplified using English sen- 
tences in Section 5 and 6, respectively, while the lightweight approach was originally 
developed for Korean. So, some language dependent parts are omitted. 



5 Domain Adaptation 

5.1 Initial Translation Knowledge 

For example, consider creating an NLDBI system for a course database in Table 1. First, 
a database modeling tool is used to generate a physical database schema in Fig. 2. 



Table 1 . Example of tuples for a course database 



Physical Database Schema 


Examples of Tuples 


T1 (T1C1, T1C2, T1C3) 

T2 (T2C1. T2C2, T2C3) 

T3 (T3C1, T3C2, T3C3, T3C4) 


(1999-0011, Richard, 1999), (2001-0027, Tom, 2001) 

(ST201A, Statistics, Richard), (ST310B, Algorithms, Joan) 
(1999-0011, ST201A, 1999, A), (2001-0027, ST201A, 2003, C) 




Fig. 2. Physical database schema for a course database 
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This can be automatically done through a reverse-engineering process that is nor- 
mally supported by most modern commercial database modeling tools. In Fig. 2, a 
rectangle indicates a table, a circle refers to a column, and an arc represents either a 
relationship between tables, or a relationship between a table and its column. 

Next, domain experts provide natural language descriptions of a physical database 
schema according to the following input guidelines. The input process is also graphi- 
cally supported by database modeling tools. 



Input 

Guidelines 



© For each domain class, input its linguistic names in the form of a noun phrase 
© For each domain class, input typical domain sentences including related domain 
classes in the form of a simple sentence 

© In describing domain sentences of ©, a noun phrase referring to a domain class 
should be either its linguistic name defined in ©, or a domain class itself. 



Table 2 shows an example of natural language descriptions for the physical data- 
base schema in Fig. 2. The reason why we need linguistic annotations is that, on the 
database side, there are no linguistically homogeneous counterparts that can be re- 
lated to linguistic terms or concepts used in natural language questions. 

Table 2. Natural language descriptions for the physical database schema in Fig. 2. 



Domain Natural Language Description 



Class 


Linguistic Name 


Domain Sentence 


T1 


Student 




T1C1 


Student identification number 




T1C2 


Student name 




T1C3 

T2 

T2C1 

T2C2 

T2C3 

T3 

T3C1 


Entrance year 
Course 

Course number 
Course name 
Instructor, professor 
Grade 

Student identification number 


Students take courses in T3C3 
Students get grades in T3C3 
Students enter a school in T1C3 
Instructors teach courses 
Instructors give grades 
Courses are open in T3C3 


T3C2 


Course number 




T3C3 


Grade year 




T3C4 


Grade 





A physical database schema annotated with natural language descriptions is called 
a pER (physically-derived Entity-Relationship) schema. The term pER schema was 
coined because we believe that it is an approximation of the target database’s original 
ER (Entity-Relationship) schema, in the sense that each component in pER schema 
has a conceptual-level label (natural language description), and its structures are di- 
rectly derived from a physical database schema. Thus, pER schema has the potential 
to bridge between linguistic constructions and physical database structures. 

From pER schema, translation knowledge is created. We define translation knowl- 
edge into two structures: class-referring and class-constraining. Class-referring 
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translation knowledge falls into two kinds: class documents and value documents. A 
class document contains a set of class terms referring to a domain class. Class terms 
are extracted from linguistic names for domain classes within the pER schema, by 
generating head-preserving noun phrases, as shown in Table 3. A value document is 
created from a set of domain class instances associated with a column. 

As class-constraining translation knowledge, there are two types of selection re- 
strictions: valency-based, and collocation-based. In this paper, valency-based selec- 
tion restriction describes a set of domain classes of arguments that verbs or preposi- 
tions take, as shown in Table 4. A Collocation-based one is simply a collection of 
collocations of each domain class in the form of collocation documents, as shown in 
Table 5. Therefore, given n tables and m columns per table, n+3nm documents are 
created. In other words, for each column, three documents are created: class, value, 
and collocation documents. 

Then, in our case, except for valency-based selection restriction, constructing 
translation knowledge corresponds to indexing documents related to domain classes. 
Also, translating a question noun into its domain classes means retrieving relevant 
documents using a question noun as an IR query. This will be explained in Section 6. 

Table 3, 4, and 5 show an example of translation knowledge associated with Table 
1 and 2. Class documents and selection restrictions can be automatically extracted, 
because linguistic descriptions of pER schema are subject to input guidelines that 
restrict natural language expressions into simple sentences and a closed set of linguis- 
tic names and domain classes. Value documents are created using value patterns that 
are automatically determined by pattern-based n-gram indexing [7], 



Table 3. Class-referring translation knowledge 



Do- 


Class-Referring Translation Knowledge 


main 

Class 


Class Document 


Value Document 


T1 


Student 

Student identification number, identi- 


NULL 


T1C1 


n4sln4, n4, si, n4sl, sln4 


fication number, number 


T1C2 


Student name, name 


Richard, Tom 


T1C3 


Entrance year, year 


1999, 2003 


T2 


Course 


NULL 


T2C1 


Course number, number 


c2n3cl, c2, n3, cl, c2n3, n3cl 


T2C2 


Course name, name 


Statistics, Algorithms 


T3C4 


Grade 


A, C 



In Table 3, n4sln4 is a value pattern of a domain class T1C1. n4sln4 means that 
values (e.g., 1999-0011 or 2001-0027) of T1C1 typically start with 4 decimal num- 
bers (n4) followed by 1 special character (si), and end with 4 decimal numbers (n4). 
A value pattern c2n3cl of T2C1 can be explained similarly. In this paper, n4sln4, or 
c2n3cl are called canonical patterns. This pattern-based value indexing [4] reduces an 
open-ended set of alphanumeric terms into a closed set of patterns. However, the 
method [4] cannot deal with random alphanumeric value terms that are not captured 
by predefined patterns, and only support exact matching. 
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To overcome the limitations of pattern-based value indexing [4], we further gener- 
ate 1-grams and 2-grams in a pattern unit from canonical patterns. For example, 
n4sln4 generates n4, si as 1 -grams, and n4sl, sln4 as 2-grams. These pattern-based 
n-grams [7] enable partial matching between patterns, and also may reduce indexing 
storage over storing canonical patterns, since canonical patterns are sliced into 
smaller n-grams that will have many duplicate n-grams. For example, given a ques- 
tion Show me courses starting with ‘st\ we can map ‘st’ into T2C1, through an n- 
gram pattern c2 in a value document of T2C1. These pattern-based n-grams [7] are 
one way to handle value term paraphrases referred to in Section 3. 



Table 4. Class-constraining translation knowledge (Valency-based selection restriction) 



Predicate or Preposition 


Set of Domain Classes 


Take 


T1.T2, T3C3 


Get 


Tl, T3. T3C4. T3C3 


Enter 


T1.T1C3 


Teach 


T2C3, T2 


Give 


T2C3, T3. T3C4 


Open 


T2, T3C3 


In 


T1C3, T3C3 



In Table 4, a set of domain classes is obtained by merging domain classes of ar- 
guments that are governed by the same predicate or preposition in domain sentences 
of Table 2. 



Table 5. Class-constraining translation knowledge (collocation-based selection restriction) 



Domain Class 


Collocation Document 


Tl 


NULL 


T1C1 


Student, identification 


T1C2 


Student 


T1C3 


Student, Entrance 


T2 


NULL 


T2C1 


Course 



In Table 5, collocation words of each domain class are acquired from its linguistic 
names in the pER schema , by gathering all the other words except the rightmost head 
of each linguistic name. In addition, for each domain class corresponding to a col- 
umn, its collocation document additionally includes all terms in the class document of 
a table to which that column belongs. For example, in the collocation document for 
T1C3, ‘ student' was inserted from the class document of Tl. 



5.2 Expansion of Initial Translation Knowledge 

To directly attack class term paraphrases, the initial translation knowledge may be 
further extended by domain knowledge extraction from domain materials, dictionar- 
ies, or corpora. Domain materials are difficult to obtain electronically. So, currently, 
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our method extracts domain knowledge (domain-dependent translation knowledge) 
from dictionaries and corpora. In this paper, we describe translation knowledge ex- 
pansion using only dictionaries. 

5.2.1 Dictionary 

Table 6 shows an example of an extended class document for a domain class T2C3 
from WordNet [10]. Extension procedures using WordNet are as follows. First, for 
each class term in an initial class document, we obtain its genus terms from its defini- 
tions in WordNet. For example, person italicized in Table 6 is a genus term of a class 
term instructor. Then, using the WordNet taxonomic hierarchy, all hypernyms of 
instructor are gathered until a genus term person or a taxonomic root is met. The 
extracted hypernyms are inserted into the initial class document of T2C3. In the case 
of professor , its hypernyms are similarly extracted using the genus term someone. 
Superscripts in Table 6 indicate hypernym levels from its initial class term in the 
taxonomic hierarchy. The superscript is used to discriminate class terms in terms of 
its class-referring strength, since a larger superscript means a more generalized class 
term. Superscript 0 means synonyms that are obtained in WordNet from the synset 
node of each initial class term. 



Table 6. Example of an extended class document 



Domain 




Class Document 


Class 


Initial 


Extended 


T2C3 


Instructor, 


Instructor, teacher 0 , educator 1 , pedagogue 1 , professional 2 , profes- 
sional_person 2 , adult 3 , grownup 3 , person , individual 4 , someone 4 , 
somebody , mortal 4 , human 4 , soul 4 


Professor 


Professor, academician 1 , academic 1 , faculty_member 1 , educator , 
pedagogue 2 , professional 3 , professional_person 3 , adult 4 , grownup 4 , 
person 5 , individual 5 , someone 5 , somebody 5 , mortal 5 , human 5 , soul 5 



However, an initial class term may have several senses in dictionaries. So, we need 
a certain word sense disambiguation (WSD) technique to eliminate inclusion of noisy 
hypernyms from incorrect senses. For this, we employ a context vector approach [11]. 
First, a target class term t is converted into a context vector V t that is a set of all words 
except t in entire natural language descriptions of pER schema. Next, for each dic- 
tionary sense s of t. we create a context vector V s that is a set of all words except t in a 
definition sentence of s. Then, we disambiguate t based on Formula 1 that measures 
the cosine similarity between two vectors. 

* V • V. 

s /Vjx/Vj (ij 

Note that if a class document D corresponding to a table is changed through this 
domain knowledge extraction, a collocation document for each column that belongs 
to the table is merged with D to become a larger one. 
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6 Question Answering 

6.1 Question Analysis 

The question answering proceeds as follows. A user writes his or her information 
need in a natural language. Next, the user question is analyzed to produce a set of 
question nouns and a set of predicate-argument pairs. In Korean, these can be ob- 
tained after morphological analysis, tagging, chunking, and dependency parsing. 
Question analysis also yields a set of feature-value pairs for each question noun. 
Among others, important question features are a question focus and a value operator, 
because these two features determine select-clause items and where-clause operators, 
respectively, in a final SQL query. 

For example, consider a user question “ show me the name of students who got A in 
statistics from 1999”. After question analysis, we obtain the first four columns in 
Table 7. Actually, ‘ show ’ is the governor of ‘name’, but we use ‘show’ only to deter- 
mine which question nouns are question foci. 



Table 7. Analysis of an example question 





Head 

Verb 


Question Features 


Relevant Domain 

Classes 


Disambiguated 
Domain Classes 


Question Noun 


Question 

Focus 


Value 

Operator 


Name 




Yes 




T1C2', T2C2 C 


T1C2 C 


Student 


Get 


No 


= 


Tl° 


T r 


A 


Get 


No 


= 


T3C4" 


T3C4" 


Statistics 


Get 


No 


= 


T2C2 V 


T2C2 V 


1999 


Get 


No 


>= 


T1C3 V , T3C3 1 


T3C3 V 



6.2 Noun Translation 

Noun translation utilizes an IR framework to translate each question noun into a 
probable domain class. First, class retrieval converts each question noun into an IR 
query, and retrieves relevant documents from class-referring translation knowledge; 
that is, a collection of entire class documents and value documents. Here, retrieved 
documents mean the candidate domain classes for the question noun, because each 
document is associated with a domain class. Next, class disambiguation selects a 
likely domain class among the candidate domain classes retrieved by class retrieval, 
using class-constraining translation knowledge. 

6.2.1 Class Retrieval 

For each question noun, class retrieval retrieves lexically or semantically equivalent 
domain classes in the form of class or value documents using an IR framework, 
where the question noun is transformed into an IR query. In Table 7, superscript c or 
v ; means that the associated domain class are obtained by retrieving class documents 
or value documents, respectively. For example, ‘A’ was mapped into ‘T3C4 V ’ by 
retrieving a value document of T3C4, and ‘name’ was mapped into ‘T1C2 C ’ or 
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‘T2C2 C ’ by retrieving a class document of TIC2 or T2C2. This distinction between 
class and value documents is important, because if a question noun retrieves a value 
document, it needs to participate in an SQL-where condition. 

The reason why an IR method is employed is twofold. First, the translation knowl- 
edge construction process is simplified to document indexing. Thus, translation 
knowledge is easily scalable to a larger size. Thus far, translation knowledge con- 
struction requires human expertise, such as AI/NLP/DBMS, so its construction proc- 
ess was time-consuming and tedious. For example, in modern logical form systems, a 
logical predicate should be defined for each lexical item, then a database relation is to 
be defined for the logical predicate. 

Second, IR can support a variety of matching schemes (e.g., lexical or conceptual 
matching, exact or approximate matching, and word or phrase matching), which are 
required to relate various forms of paraphrases for domain classes or domain class 
instances. These various IR matching techniques correspond to indirect methods in 
order to handle class term or value term paraphrase problems. 

6.2.2 Class Disambiguation 

After class retrieval, two types of translation ambiguities introduced in Section 3 may 
occur. In our example question, class term ambiguity occurrs on a question noun 
‘name’, and value term ambiguity on ‘1999’ . Then, to each question noun holding 
two or more relevant domain classes, selection restrictions in Table 4 and 5 are ap- 
plied to determine the correct domain classes. 

In order to resolve the value term ambiguity of ‘1999 ’ , ‘get’ (a governor of ‘1999’) 
is lexically or conceptually searched over valency-based selection restrictions in Ta- 
ble 4 to find that a domain verb ‘get’ requires its arguments to be B={T1, T3, T3C4, 
T3C3}. Let A be a set of relevant domain classes of ‘1999’ (A={T1C3, T3C3}). 
Then, AnBis calculated to disambiguate ‘1999’ into a correct domain class T3C3. 

However, valency-based selection restrictions in Table 4 cannot be applied to 
translation ambiguity whose target word does not have predicates or prepositions as 
its governor, like ‘name’ . In this case, collocation-based selection restrictions in Table 
5 are used as follows. To disambiguate ‘name ’ , an adjacent word ‘student’ of a target 
word ‘name’ is used as an IR query to retrieve relevant collocation documents 
B={T1C1, T1C2, T1C3}. Note that elements in B are not class documents but collo- 
cation document. Let A be a set of relevant domain classes of ‘name’ (A={T1C2, 
T2C2}). Then, similarly, A n B is calculated to disambiguate ‘name’ into a correct 
domain class T1C2. 



6.3 Database Query Generation 

In this stage, if any ambiguous question noun occurs having two or more domain 
classes, a user confirmation feedback procedure is fired in order to ask a user to indi- 
cate a correct domain class. Otherwise, the following query generation procedure 
begins. For query generation, we adopt the Zhang’s method [15]. 
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Fig. 3. Finding a query graph from a physical database schema 



Disambiguated domain classes constitute a sub-graph on the physical database 
schema, which is viewed as a graph where a node corresponds to a table or a column, 
and an arc is defined by a relationship between two tables, or a property between a 
table and its column. Fig. 3 shows the sub-graph that we call a query graph. In our 
case, the problem is to locate a sub-graph with a minimal number of nodes. 

From the query graph and question features in Table 7, a formal database query is 
generated. The entity nodes of a query graph are transformed to an SQL-from clause, 
and arcs between entities constitute SQL join operators in an SQL-where clause. An 
SQL-select clause is obtained from question focus features, and value operator fea- 
tures are applied to the corresponding attribute nodes to produce an SQL-where 
clause. Note that value operators are fired only on the question nouns that retrieved 
value documents. Thus, the final SQL query is as follows. 



SELECT 


T1C2 






FROM 


T1,T2,T3 






WHERE 


T1.T1C1 = T3.T3C1 


AND 


T2.T2C1 = T3.T3C2 


AND 


T2C2 = ‘Statistics’ 


AND 


T3C3 = ‘A’ AND T3C4 >= 1999 



7 Discussions and Conclusion 

To alleviate the transportability problem of NLDBI translation knowledge, this paper 
presented an overview of a lightweight NLDBI approach based on an IR framework, 
where most translation knowledge is simplified in the form of documents such as 
class, value, and collocation documents. Thus, translation knowledge construction is 
reduced to document indexing, and noun translation can be carried out based on 
document retrieval. In addition, translation knowledge can be easily expanded from 
other resources. Another motivation is to create a robust NLDBI system. For the 
robustness, we used a syntactic analyzer only to obtain question nouns and its predi- 
cate-argument pairs, rather than entire parse trees. 

Meng et al. [9] proposed a similar IR approach. However, our approach mainly 
differs from theirs in terms of the following points. First, we focused on automated 
construction and its scalable expansion of translation knowledge, in order to increase 
transportability. Next, our disambiguation strategy in noun translation relies on lin- 
guistically motivated selection restrictions that are extracted from predicate-argument 
pairs of domain predicates. However, Meng et al. [9] used neighboring words as 
disambiguation constraints, because their method does not perform any syntactic 
analysis. 
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Currently, the lightweight approach was applied to only one database domain [7], 
In future, it requires much empirical evaluation to validate its operability. For exam- 
ple, input guidelines in Section 5.1 should be tested on both different user groups and 
several database domains. In addition, we assumed that question meaning is approxi- 
mated by a query graph (and its structural constraints) on a physical database schema, 
that is determined using question nouns and its predicate-argument structures. So, the 
lightweight method may implicitly restrict expressiveness of user questions. We plan 
to investigate how many and what kinds of user questions are well suited to the light- 
weight approach. 
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Abstract The semantic web vision is one in which rich, ontology-based se- 
mantic markup is widely available, both to enable sophisticated interoperability 
among agents and to support human web users in locating and making sense of 
information. The availability of semantic markup on the web also opens the 
way to novel, sophisticated forms of question answering. AquaLog is a portable 
question-answering system which takes queries expressed in natural language 
and an ontology as input and returns answers drawn from one or more knowl- 
edge bases (KBs), which instantiate the input ontology with domain-specific in- 
formation. AquaLog makes use of the GATE NLP platform, string metrics al- 
gorithms, WordNet and a novel ontology-based relation similarity service to 
make sense of user queries with respect to the target knowledge base. Finally, 
although AquaLog has primarily been designed for use with semantic web lan- 
guages, it makes use of a generic plug-in mechanism, which means it can be 
easily interfaced to different ontology servers and knowledge representation 
platforms. 



1 Introduction 

The semantic web vision [l]is one in which rich, ontology-based semantic markup is 
widely available, both to enable sophisticated interoperability among agents, e.g., in 
the e-commerce area, and to support human web users in locating and making sense 
of information. For instance, tools such as Magpie [2] support sense-making and 
semantic web browsing by allowing users to select a particular ontology and use it as 
a kind of ‘semantic lens’, which assists them in making sense of the information they 
are looking at. As discussed by McGuinness in her recent essay on “Question An- 
swering on the Semantic Web” [3], the availability of semantic markup on the web 
also opens the way to novel, sophisticated forms of question answering, which not 
only can potentially provide increased precision and recall compared to today’s 
search engines, but are also capable of offering additional functionalities, such as i) 
proactively offering additional information about an answer, ii) providing measures 
of reliability and trust and/or iii) explaining how the answer was derived. 

While semantic information can be used in several different ways to improve 
question answering, an important (and fairly obvious) consequence of the availability 
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of semantic markup on the web is that this can indeed be queried directly. For in- 
stance, we are currently augmenting our departmental web site, http://kmi.open.ac.uk . 
with semantic markup, by instantiating an ontology describing academic life [4] with 
information about our personnel, projects, technologies, events, etc., which is auto- 
matically extracted from departmental databases and unstructured web pages. In the 
context of standard, keyword-based search this semantic markup makes it possible to 
ensure that standard search queries, such as “peter scott home page kmi”, actually 
return Dr Peter Scott’s home page as their first result, rather than some other resource 
(as indeed is the case when using current non-semantic search engines on this par- 
ticular query). Moreover, as pointed out above, we can also query this semantic 
markup directly. For instance, we can ask a query such as “list all the kmi projects in 
the semantic web area” and, thanks to an inference engine able to reason about the 
semantic markup and draw inferences from axioms in the ontology, we can then get 
the correct answer. 

This scenario is of course very similar to asking natural language queries to data- 
bases (NLDB), which has long been an area of research in the artificial intelligence 
and database communities [5] [6] [7] [8] [9], even if in the past decade has somewhat 
gone out of fashion [10] [11]. However, it is our view that the semantic web provides 
a new and potentially very important context in which results from this area of re- 
search can be applied. Moreover, interestingly from a research point of view, it pro- 
vides a new ‘twist’ on the old issues associated with NLDB research. Hence, in the 
first instance, the work on the AquaLog query answering system described in this 
paper is based on the premise that the semantic web will benefit from the availability 
of natural language query interfaces, which allow users to query semantic markup 
viewed as a knowledge base. Moreover, similarly to the approach we have adopted in 
the Magpie system, we believe that in the semantic web scenario it makes sense to 
provide query answering systems on the semantic web. which are portable with re- 
spect to ontologies. In other words, just like in the case of Magpie, where the user is 
able to select an ontology (essentially a semantic viewpoint) and then browse the web 
through this semantic filter, our AquaLog system allows the user to choose an ontol- 
ogy and then ask queries with respect to the universe of discourse covered by the 
ontology. 



2 The AquaLog Approach 

AquaLog is a portable question-answering system which takes queries expressed in 
natural language and an ontology as input and returns answers drawn from one or 
more knowledge bases (KBs), which instantiate the input ontology with domain- 
specific information. As already emphasized, a key feature of AquaLog is that it is 
modular with respect to the input ontology, the aim here being that it should be zero 
cost to switch from one ontology to another when using AquaLog. 

AquaLog is part of the AQUA [12] framework for question answering on the se- 
mantic web and in particular addresses the upstream part of the AQUA process, the 
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translation of NL queries into logical ones and the interpretation of these NL-derived 
logical queries with respect to a given ontology and available semantic markup. 

AquaLog adopts a triple-based data model and, as shown in figure 1, the role of 
the linguistic component of AquaLog is to translate the input query into a set of in- 
termediate triples, of the form <subject, predicate, objectx These are then further 
processed by a module called the “Relation Similarity Service” (RSS), to produce 
ontology-compliant queries. For example, in the context of the academic domain 
mentioned earlier, AquaLog is able to translate the question “Who is a Professor at 
the Knowledge Media Institute?” into the following, ontology-compliant logical 
query, <typeOf ?x Professor-in-Academia> & <works-in-unit ?x KMi>, expressed as 
a conjunction of non-ground triples (i.e., triples containing variables). The role of the 
RSS is to map the intermediate form, <?who, Professor, KMi> into the target, ontol- 
ogy-compliant query. 

There are two main reasons for adopting a triple-based data model. First of all, as 
Katz et al. point out [13], although not all possible queries can be represented in the 
binary relational model, in practice these occur very frequently. Secondly, RDF-based 
knowledge representation (KR) formalisms for the semantic web, such as RDF itself 
[14] or OWL [15] also subscribe to this binary relational model and express state- 
ments as <subject, predicate, objectx Hence, it makes sense for a query system tar- 
geted at the semantic web to adopt this data model. 
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Fig. 1 . The Architecture of AquaLog 



We have seen that, in common with most other NLDB systems, AquaLog divides 
the task of mapping user queries to answers into two main sub tasks: producing an 
intermediate logical representation from the input query and mapping this intermedi- 
ate query into a form consistent with the target knowledge base. Moreover it explic- 
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itly restricts the range of questions the user is allowed to ask to a set of expres- 
sions/syntactic patterns, so that the linguistic limits of the system are obvious to the 
user (to avoid the effort of rephrasing questions) and to ensure users understand 
whether a query to AquaLog failed for reasons which are linguistic (failure to under- 
stand the linguistic structure of the question) or conceptual (failure of the ontology to 
provide meaning to the concepts in the query). 

In the next section we describe the AquaLog architecture in more detail. 



3 The AquaLog Architecture 

AquaLog was designed with the aim of making the system as flexible and as modular 
as possible. It is implemented in Java as a web application, using a client-server ar- 
chitecture. A key feature is the use of a plug-in mechanism, which allows AquaLog 
to be configured for different KR languages. Currently we use it with our own 
OCML-based KR infrastructure [16] [17], although in the future we plan to provide 
direct plug-in mechanisms for use with the emerging RDF and OWL servers 1 . 



3.1 Initialization and User’s Session 

At initialization time the AquaLog server will access and cache basic indexing data 
for the target KB(s), so that they can be efficiently accessed by the remote clients to 
guarantee real-time question answering, even when multiple users access the server 
simultaneously. 



3.2 Gate Framework for Natural Language and Query Classify Service 

AquaLog uses the GATE [18] [19] infrastructure and resources (language resources, 
processing resources like ANNIE, serial controllers, etc.) to map the input query in 
natural language to the triple-based data model. Communication between AquaLog 
and GATE takes place through the standard GATE API. The GATE chunker used for 
this task does not actually generate a parse tree. As discussed by Katz et al. [20] [21], 
although parse trees (as for example, the NLP parser for Stanford [22]) capture syn- 
tactic relations, they are often time-consuming and difficult to manipulate. Moreover, 
we also found that in our context we could do away with much of the parsed infor- 
mation. For the intermediate representation, we use the triple-based data model rather 
than logic, because at this stage we do not have to worry about getting the representa- 
tion right. The role of the intermediate representation is simply to provide an easy to 
manipulate input for the RSS. 



1 In any case we are able to import and export RDF(S) and OWL from OCML, so the lack of 
an explicit RDF/OWL plug-in is actually not a problem in practice. 
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After the execution of the GATE controller a set of syntactical annotations are re- 
turned associated with the input query. Annotations include information about sen- 
tences, tokens, nouns and verbs. For example we get voice and tense for the verbs, or 
categories for the nouns, such as determinant, singular/plural, conjunction, posses- 
sive, determiner, preposition, existential, wh-determiner, etc. When developing 
AquaLog we extended the set of annotations returned by GATE, by identifying rela- 
tions and question indicators (which/who/when/etc.). This was achieved through the 
use of Jape grammars. These consist of a set of phases, which run sequentially, each 
of which defined as a set of pattern rules. The reason for this extension was to be able 
to clearly identify the scope of a relation - e.g., to be able to identify “has research 
interest” as a relation. Here we exploited the fact that natural language commonly 
employs a preposition to express a relationship. 

Although it is not necessary, we could improve the precision of the AquaLog lin- 
guistic component by modifying or creating the appropriate jape rules for specific 
cases. For example, the word “project” could be understood as a noun or as a verb, 
depending on the priority of the rules. Another example is when some disambiguation 
is necessary as in the example: “Has john research interest in ontologies?”. Here 
“research” could be either the last name of John or a noun part of the relation “has 
research interest” 2 . 

As shown in figure 1, the architecture also includes a post-processing semantic 
(PPS) module to further process the annotations obtained from the extended GATE 
component. For example when processing the query “John has research interest in 
ontologies”, the PPS ensures that the relation is identified as “has research interest”. 
Other more complex cases are also dealt with. 

Finally, before passing the intermediate triples to the RSS, AquaLog performs two 
additional checks. If it is not possible to transform the question into a term-relation 
form or the question is not recognized, further explanation is given to the user, to help 
him to reformulate the question. If the question is valid, then a Query Classify Service 
is invoked to determine the type of the question, e.g., a “where” question, and pass 
this information on to the Relation Similarity Service. 



3.3 The Relation Similarity Service 

This is the backbone of the question-answering system. The RSS is called after the 
NL query has been transformed into a term-relation form and classified and it is the 
main component responsible for producing an ontology-compliant logical query. 

Essentially the RSS tries to make sense of the input query by looking at the struc- 
ture of the ontology and the information stored in the target KBs, as well as using 
string similarity matching and lexical resources, such as WordNet. There is not a 
single strategy here, so we will not attempt to give a comprehensive overview of all 



2 Of course a better way to express the query would be “Has John got a research interest in 
ontologies?”, which can be parsed with no problems. However, in our experience these 
slightly un-grammatical queries are very common and it is our aim to produce a system ro- 
bust enough to deal with many of them. 
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the possible ways in which a query can be interpreted. Rather, we will show one ex- 
ample, which is illustrative of the way the RSS works. 

An important aspect of the RSS is that it is interactive. In other words when unsure 
it will ask the user for help, e.g., when it is not able to disambiguate between two 
possible relations which can be used to interpret the query. 

For example, let’s consider a simple question like “who works on the semantic 
web?”. Here, the first step for the RSS is to identify that “semantic web” is actually a 
“research area” in the target KB 3 . If a successful match is found, the problem be- 
comes one of finding a relation which links a person (or an organization) to the se- 
mantic web area. 




\j total 



Fig. 2. AquaLog in action. 

By analyzing the KB, AquaLog finds that the only relation between a person and 
the semantic web area is has-research-interest and therefore suggests to the user that 
the question could be interpreted in terms of this relation. If the user clicks OK, then 
the answer to the query is provided. It is important to note that in order to make sense 
of the triple <person, works, semantic web>, all subclasses of person need to be con- 
sidered, given that the relation has-research-interest could be defined only for re- 
searchers rather than people in general. If multiple relations are possible candidates 
for interpreting the query, then string matching is used to determine the most likely 
candidate, using the relation name, eventual aliases, or synonyms provided by lexical 
resources such as WordNet [23]. If no relations are found using this method, then the 



3 Naming convention vary depending on the KR used to represent the KB and may even 
change with ontologies - e.g., an ontology can have slots such as “variant name” or “pretty 
name”. AquaLog deals with differences between KRs by means of the plug-in mechanism. 
Differences between ontologies need to be handled by specifying this information in a con- 
figuration file. 
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user is asked to choose from the current list of candidates. However, it is important to 
emphasise that calling on the users to disambiguate is only done if no information is 
available to AquaLog, which allows the system to disambiguate the query directly. 
For instance let’s consider the two queries shown in figure 3. On the right screen we 
are looking for the web address of Peter and given that the system is unable to disam- 
biguate between Peter-Scott, Peter-Sharpe or Peter- Whalley, user’s feedback is re- 
quired. However, on the left screen we are asking for the web address of Peter, who 
has an interest in knowledge reuse. In this case AquaLog does not need assistance 
from the user, given that only one of the three Peters has an interest in knowledge 
reuse. 




Fig. 3. Automatic or user-driven disambiguation. 



A typical situation the RSS has to cope with is one in which the structure of the 
intermediate query does not match the way the information is represented in the on- 
tology. For instance, the query “which are the publications of John?” may be parsed 
into <John, publications, something;*, while the ontology may be organized in terms 
of Publication, has-author, Authorx Also in these cases the RSS is able to reason 
about the mismatch and generate the correct logical query. 



3.4 Helping the User Making Sense of the Information 

We have already mentioned that AquaLog is an interactive system. If the RSS fails to 
make sense of a query, the user is asked to help choose the right relation or instance. 
Moreover, in order to help the user, AquaLog shows as much information as possible 
about the current query, including both information drawn from WordNet and from 
the ontology - see figure 4. 
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Fig. 4. Providing additional information 

Finally, when answers are presented to the user, it is also possible for him/her to 
use these answers as starting points for navigating the KB - see figure 5. 




Fig. 5. Navigating the KB starting from answers to queries. 
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3.5 String Matching Algorithms 

String algorithms are used to find patterns in the ontology for any of the terms inside 
the intermediate triples obtained from the user’s query. They are based on String 
Distance Metrics for Name-Matching Tasks, using an open-source from the Carnegie 
Mellon University in Pittsburgh [24], This comprises a number of string distance 
metrics proposed by different communities, including edit-distance metrics, fast heu- 
ristic string comparators, token-based distance metrics, and hybrid methods. The 
conclusions of an experimental comparison of metrics realized by the University of 
Pittsburgh states: "Overall, the best-performing method is an hybrid scheme combin- 
ing a TFIDF weighting scheme ... with the Jaro-Winkler string distance scheme, 
developed in the probabilistic record linkage community’’. 

However, it is not recommendable to trust just one metric. Experimental com- 
parisons using the AKT ontology with different metrics show that the best per- 
formance is obtained with a combination of the following metrics: JaroWinkler, 
Level2JaroWinkler and Jaro. 

A key aspect of using metrics is thresholds. By default two kinds of thresholds are 
used, called: “trustable” and “did you mean?”, where the former is of course always 
preferable to the latter. AquaLog uses different thresholds depending on whether it is 
looking for concepts names, relations or instances. 



4 Evaluation 

4.1 Scenario and Results 

A full evaluation of AquaLog will require both an evaluation of its query answering 
ability as well an evaluation of the overall user experience. Moreover, because one of 
our key aims is to make AquaLog portable across ontologies, this aspect will also 
have to be evaluated formally. While a full evaluation has not been carried out yet, 
we performed an initial study, whose aim was to assess to what extent the AquaLog 
application built using AquaLog with the AKT ontology and the KMi knowledge 
base satisfied user expectations about the range of questions the system should be 
able to answer. A second aim of the experiment was also to provide information 
about the nature of the possible extensions needed to the ontology and the linguistic 
components - i.e., not only we wanted to assess the current coverage of the system 
but also get some data about the complexity of the possible changes required to gen- 
erate the next version of the system. Thus, we asked ten members of KMi, none of 
whom has been involved in the AquaLog project, to generate questions for the sys- 
tem. Because one of the aims of the experiment was to measure the linguistic cover- 
age of the system with respect to user needs, we did not provide them with any in- 
formation about the linguistic ability of the system. However, we did tell them 
something about the conceptual coverage of the ontology, pointing out that its aim 
was to model the key elements of a research lab, such as publications, technologies, 
projects, research areas, people, etc. 
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We also pointed out that the current KB is limited in its handling of temporal infor- 
mation, therefore we asked them not to ask questions which required sophisticated 
temporal reasoning. Because no ‘quality control’ was carried out on the questions, it 
was perfectly OK for these to contain spelling mistakes and even grammatical errors. 

We collected in total 76 different questions, 37 of which were handled correctly by 
AquaLog, i.e,, 48.68% of the total. This was a pretty good result, considering that no 
linguistic restriction was imposed on the questions. 

We analyzed the failures and divided them into the following five categories (the 
total adds up to more than 37 because a query may fail at several different levels): 

• Linguistic failure. This occurs when the NLP component is unable to generate 
the intermediate representation (but the question can usually be reformulated 
and answered). This was by far the most common problem, occurring in 27 of 
the 39 queries not handled by AquaLog (69%). 

• Data model failure. This occurs when the NL query is simply too complicated 
for the intermediate representation. Intriguingly this type of failure never oc- 
curred, and our intuition is that this was the case not only because the relational 
model is actually a pretty good way to model queries but also because the ontol- 
ogy-driven nature of the exercise ensured that people only formulated queries 
that could in theory (if not in practice) be answered by reasoning about the de- 
partmental KB. 

• RSS failure. This occurs when the relation similarity service of AquaLog is un- 
able to map an intermediate representation to the correct ontology-compliant 
logical expression. Only 3 of the 39 queries not handled by AquaLog (7.6%) 
fall into this category. 

• Conceptual failure. This occurs when the ontology does not cover the query. 
Only 4 of the 39 queries not handled by AquaLog (10.2%) failed for this reason. 

• Service failure. Several queries essentially asked for services to be defined over 
the ontologies. For instance one query asked about “the top researchers”, which 
requires a mechanism for ranking researchers in the lab - people could be ranked 
according to citation impact, formal status in the department, etc. In the context 
of the semantic web, we believe that these failures are less to do with shortcom- 
ings of the ontology than with the lack of appropriate services, defined over the 
ontology. Therefore we defined this additional category which accounted for 8 
of the 39 failures (20.5%). 



4.2 Discussion 

Here we briefly discuss the issues raised by the evaluation study and in particular 
what can be done to improve the performance of the AquaLog system. 

Service failures can of course be solved by implementing the appropriate services. 
Some of these can actually be to some extent ontology-independent, such as “similar- 
ity services”, which can answer questions like “Is there a project similar to AKT?”. 
Other services can be generically categorized (e.g., “ranking services”), but will have 
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to be defined for specific concepts in an ontology, such as mechanisms to rank people 
publications, or projects. Here we envisage a solution similar to the one used in the 
Magpie tool [2], where service developers are given publishing rights to develop and 
associate services to concepts in an ontology, using semantic web service platforms 
such as IRS-II [25]. 

The few RSS failures basically highlighted bugs in AquaLog all of which can be 
fixed quite easily. A clear example of this is the query “who funds the magpie proj- 
ect”, where “who” is understood to be a person, while it of course can also be an 
organization or funding body. 

The few conceptual failures are also easy to fix, they highlighted omissions in the 
ontology. 

The most common and problematic errors are linguistic ones, which occurred for a 
number of reasons: 

In some cases people asked new types of basic queries outside the current linguis- 
tic coverage, formed with: “how long”, “there are/is”, “are/is there any”, “how 
many”, etc. 

Some problems were due to combinations of basic queries, such as “What are 
the publications in KMi related to social aspects in collaboration and learning?”, 
which the NLP component of AquaLog cannot currently untangle correctly. An- 
other example is when one of the terms is “hidden” because it is included in the 
relation but actually it is not part of the relation, as for example in “Who are the 
partners involved in the AKT project?” One may think the relation is “partners 
involved in” between persons and projects, however the relation is “involved in” 
between “partners” and “projects”. 

Sometimes queries fail because of a combination of different reasons. For instance, 
“which main areas are corporately funded?”, falls within the category of ranking 
failure, but it is also a linguistic and conceptual failure (the latter because the ontol- 
ogy lacks a funding relationship between research-areas and corporations). 

In sum, our estimation is that the implementation of the ranking services plus ex- 
tending the NLP component to cover the basic queries not yet handled linguistically 
will address 14 of the 39 failures (35.89%). An implementation of new mechanisms 
to handle combinations of basic queries will address another 12 failures (30.7%). 
Hence, removing redundancies and including also the fixes to the ontology, with both 
implementations it will be possible to handle 34 of the 39 failures, thus potentially 
achieving a 87% hit rate for AquaLog. Although a more detailed analysis of these 
problems may be needed, at this stage it does not seem particularly problematic to 
add these functionalities to AquaLog. 



5 Related Work 

We already pointed out that research in natural language interfaces to databases is 
currently a bit ‘dormant’ (although see [26] for recent work in the area), therefore it is 
not surprising that most current work on question answering is somewhat different in 
nature from AquaLog. Natural language search engines such as AskJeeves [27] and 
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EasyAsk [28] exist, as well as systems such as START [13] and REXTOR [21], 
whose goal is to extract answers from text. AnswerBus [29] is another open-domain 
question-answering system based on web information retrieval. FAQ Finder [30] is a 
natural language question-answering system that uses files of FAQs as its knowledge 
base; it also uses WordNet to improve its ability to match questions to answers, using 
two metrics: statistical similarity and semantic similarity. 

PRECISE [26] maps questions to the corresponding SQL query, by identifying 
classes of questions that are easy to understand in a well defined sense: the paper 
defines a formal notion of semantically tractable questions. Questions are sets of 
attribute/value pairs and a relation token corresponds to either an attribute token or a 
value token. Apart from the differences in terminology this is actually similar to the 
AquaLog model. In PRECISE, like in AquaLog, a lexicon is used to find synonyms. 
In PRECISE the problem of finding a mapping from the tokenization to the database 
is reduced to a graph matching problem. The main difference with AquaLog is that in 
PRECISE all tokens must be distinct, questions with unknown words are not semanti- 
cally tractable and cannot be handled. In contrast with PRECISE, although AquaLog 
also uses pattern matching to identify at least one of the terms of the relation, it is still 
able in many cases to interpret the query, even if the words in the relation are not 
recognized (i.e., there is no match to any concept or relation in the ontology). The 
reason for this is that AquaLog is able to reason about the structure of the ontology to 
make sense of relations which appear to have no match to the KB. Using the example 
suggested in [26], AquaLog would not necessarily know the term “neighborhood”, 
but it might know that it must look for the value of a relation defined for cities. In 
many cases this information is all AquaLog needs to interpret the query. 

MASQUE/SQL [7] is a portable natural language front-end to SQL databases. The 
semi-automatic configuration procedure uses a built-in domain editor which helps the 
user to describe the entity types to which the database refers, using an is-a hierarchy, 
and then declare the words expected to appear in the NL questions and define their 
meaning in terms of a logic predicate that is linked to a database table/view. In con- 
trast with MASQUE/SQL AquaLog uses the ontology to describe the entities with no 
need for an intensive configuration procedure. 



6 Conclusion 

In this paper we have described the AquaLog query answering system, emphasizing 
its genesis in the context of semantic web research. Although only initial evaluation 
results are available, the approach used by AquaLog, which relies on a RSS compo- 
nent able to use information about the current ontology, string matching and similar- 
ity measures to interpret the intermediate queries generated by the NLP component, 
appears very promising. Moreover, in contrast with other systems AquaLog requires 
very little configuration effort. For the future we plan to make the AquaLog linguistic 
component more robust, primarily on the basis of the feedback received from the 
evaluation study carried out on the KMi domain. In addition we also intend to carry 
out a formal analysis of the RSS component to provide a more accurate and formal 
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account of its competence. As already mentioned, more comprehensive evaluation 
studies will also be needed. Finally, although the ontology-driven approach provides 
one of the main strength of AquaLog we have also started to investigate the possibil- 
ity of accessing more than one ontology simultaneously in a transparent way for the 
user [31]. 
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Abstract. This paper addresses the problem of mapping Natural Language to 
SQL queries. It assumes that the input is in English language and details a 
methodology to build a SQL query based on the input sentence, a dictionary 
and a set of production rules. The dictionary consists of semantic sets and index 
files. A semantic set is created for each table or attribute name and contains 
synonyms, hyponyms and hypernyms as retrieved by WordNet and comple- 
mented manually. The index files contain pointers to records in the database, 
ordered by value and by data type. The dictionary and the production rules 
form a context-free grammar for producing the SQL queries. The context am- 
biguities are addressed through the use of the derivationally related forms based 
on WordNet. Building the run time semantic sets of the input tokens helps 
solving the ambiguities related to the database schema. The proposed method 
introduces two functional entities: a pre-processor and a runtime engine. The 
pre-processor reads the database schema and uses WordNet to create the se- 
mantic sets and the set of production rules. It also reads the database records 
and creates the index files. The run time engine matches the input tokens to the 
dictionary and uses the rules to create the corresponding SQL query. 



1 Introduction 

In early research work Minker [3] identified three basic steps in developing a natural 
language processing system. First, a model for the input language must be selected. 
Second, the input language must be analyzed and converted into an internal repre- 
sentation based on the syntactic analysis. Third, the internal representation is trans- 
lated in the target language by using the semantic analysis. The syntactic and seman- 
tic analyzes can use learning algorithms and statistical methods [5] as shown in Fig. 
1 . The methods involved in the semantic analysis, either deterministic or statistical are 
heavily dependent on the context. There is always a degree of uncertainty related to 
the context and to the ambiguities of the natural language that translates into errors in 
the target language. Stuart, Chen and Shyu showed that the influence of the context 
on meaning grows exponentially with the length of a word sequence and it can be 
addressed through the so-called rule-based randomization [4]. Another way to mini- 
mize the uncertainty during the semantic analysis is to use highly organized data. 

F. Meziane and E. Metais (Eds.): NLDB 2004, LNCS 3136, pp. 103-113, 2004. 
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Fig. 1 . Natural Language Processing through learning algorithms and statistical methods 



The present paper describes a method for the semantic analysis of the natural lan- 
guage query and its translation in SQL for a relational database. It is a continuation of 
the work presented at NLDB’2003 conference in Cotbus, Germany [2], 



2 Natural Language Processing and RDBMS 1 

The highly organized data in RDBMS can be used to improve the quality of the se- 
mantic analysis. Previous work [2] introduced a template-based semantic analysis 
approach as shown on the left hand side of Fig. 2. The system consists of a syntactic 
analyzer based on the LinkParser [10] and a semantic analyzer using templates. The 
pre-processor builds the domain-specific interpretation rules based on the database 
schema, WordNet and a set of production rules. This method has limited capabilities 
in addressing the context ambiguities. The current goal is to improve the context 
disambiguation. 

The method presented in this paper is shown on the right hand side of Fig. 2. The intent 
is to build the semantic analysis through the use of the database properties. Unlike the 
work described in [2], this method does not use syntactic analysis and it does not use 
user supplied rules. The input sentence is tokenized and then it is sent to the semantic 
analyzer. The semantic analysis is based on token matching between the input and a 
dictionary. The dictionary is formed of semantic sets and index files. The semantic sets 
are based on the synonyms, hypernyms and hyponyms related to the table and attribute 
names, as returned by WordNet [1] The semantic set is manually edited to eliminate the 
less relevant elements and to address the case of meaningless names and abbreviations, 
such as Empl for Employers or Stdn for Students. The index files are links to the actual 
records in the database. The production rules are based on the database schema and are 
used to generate the target SQL query as shown in Fig. 3. 



1 RDBMS Relational Database Management System 
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Fig. 2. Differences between the template-based and the token matching approaches 




Fig. 3. Input language to SQL query 



The internal representation is reduced to a set of tokens extracted form the natural 
language query. Using the derivationally related forms based on WordNet solves the 
run time ambiguities as shown in the following paragraphs. 



3 The High Level Architecture 

The proposed method is based on two functional units: a pre-processor and a runtime 
engine. The pre-processor analyzes the database schema and creates the semantic sets, 
the index files and the production rules, as shown in Fig. 4. The run time engine uses 
the semantic sets and the index files to match the input tokens with table and attribute 
names or with values in the database. 
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Fig. 4. The high level design 



3.1 The Database Schema 

In the example shown in Fig. 4. the database contains four tables as described below: 

Table AUTHORS (LONG id PRIMARY_KEY, CHAR name, DATE DOB) 

Table BOOKS (LONG id PRIMARY_KEY, CHAR title, LONG isbn, LONG pages, 
DATE date_published, ENUM type) 

Table BOOKAUTHORS (LONG aid references AUTHORS, LONG bid references 
BOOKS) 

Table STUDENTS (LONG id PRIMARY_KEY , CHAR name, DATE DOB) 



Table AUTHORS (id, name, DOB) 

1, 'Mark Twain', NOV-20-1835 

2, 'William Shakespeare', APR-23-1564 

3, 'Emma Lazarus', MAR-3 -1849 

Table BOOKS (id, title, isbn, pages, date_published, type) 

1, 'Poems', 12223335, 500, FEB-14, SEP-1-1865, 'POEM' 

2, 'Tom Sawyer', 12223333, 700, NOV-1-1870, 'ROMAN' 

3, 'Romeo and Juliet', 12223334, 600, AUG-5-1595, 'TRAGEDY' 
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Table BOOKAUTHORS (aid, bid) 

1 , 2 

2, 3 

3, 1 

Table STUDENTS (id, name, DOB) 

1, 'John Markus', FEB-2-1985 

2, 'Mark Twain', APR-2-1980 

3, 'Eli Moss', SEP-10-1982 



3.2 Using WordNet to Create the Semantic Sets 

The database schema table and attribute names are introduced as input to WordNet" to 
build the semantic sets for each table and attribute name. The semantic sets include 
synonyms, hyponyms and hyponyms of each name in the schema. For table 
AUTHORS WordNet returns the following list: 

WordNet 2 . 0 Search 
Top of Form 
Bottom of Form 
Overview for 'author' 

The noun 'author' has 2 senses in WordNet. 

1. writer, author -- (writes (books or stories or articles or the 
like) professionally (for pay) ) 

2. generator, source, author -- (someone who originates or causes or 
initiates something; 'he was the generator of several complaints') 
Results for 'Synonyms, hypernyms and hyponyms ordered by estimated 
frequency' search of noun 'authors' 

2 senses of author 

Sense 1 writer, author => communicator 

Sense 2 generator, source, author => maker, shaper 

The semantic set for AUTHORS becomes: 

{writer, author, generator, source, maker, communicator, shaper} 

Any token match between the input sentence and the AUTHORS semantic set ele- 
ments will place the AUTHORS in the SQL table list. WordNet returns the following 
semantic set for BOOKS: 

{book, volume, ledger, leger, account book, book of account, record, 
record book, script, playscript, rule book} 

Similarly all the remaining table and attribute names in the database schema are asso- 
ciated with their semantic sets as returned by WordNet. One limitation of the model 
resides in the necessity of having meaningful names for all attributes and tables. If 
this is not the case, the corresponding semantic sets have to be built manually, by 
sending the appropriate queries to WordNet. For example, instead of using the 
AUTHORS attribute name ‘DOB’ the operator queries WordNet for ‘Date’ and adds 
‘birth date’ to the semantic set associated with AUTHORS. DOB. 



2 WordNet Reference: http://www. cogsci.princeton.edu/~wn/ [1] 
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3.3 Indexing the ENUM Types 

An index file relating (Table, Attribute) and the ENUM values is created for each 
ENUM attribute. Example: TABLE=BOOKS, ATTRIBUTE=TYPE, 

ENUM={ ROMAN, NOVEL, POEM} If any of the TYPE values occurs in the input 
sentence, a new production rule is added to the SQL relating BOOKS.TYPE to 
VALUE such as the one in the example below: 

Sentence: 'List all romans' 

ROMAN is a valid value for BOOK. TYPE 
The production rule is : 

WHERE BOOKS . TYPE= ’ ROMAN ' 

The resulting SQL query is: 

SELECT BOOKS . * FROM BOOKS WHERE BOOKS . TYPE= ’ ROMAN ' 



3.4 Indexing the TEXT Type 

A compressed BLOB 3 relating (Table, Attribute) and the attribute values is created for 
each TEXT attribute. Example: TABLE= AUTHORS, ATTRIB UTE=NAME, 
TEXT={‘Mark Twain’, ‘William Shakespeare’, ‘Emma Lazarus’} If any of the in- 
dexed NAME values occurs in the input sentence, a new production rule is added to 
the SQL relating AUTHORS. NAME to VALUE as in the example below: 

Sentence: 'Show information about Mark Twain' 

'Mark', 'Twain' are values related to AUTHORS. NAME 
The production rule is : 

AUTHORS . NAME- ' Mark Twain ' 

The resulting SQL query is: 

SELECT AUTHORS.* FROM AUTHORS WHERE AUTHORS . NAME= ' Mark Twain' 



3.5 Building the SQL Production Rules 

The database schema is used to build a set of SQL query elements involving relations 
to be joined. For example the tables AUTHORS and BOOKS participate in a N-to-N 
relationship with the table BOOKAUTHORS as shown in the database schema given 
in section 3.1. 

The pre-processor builds the following production rule: 

IF AUTHORS in Table List AND BOOKS in Table List Then BOOKAUTHORS is 
in Table List 



and the following SQL template: 

SELECT Attribute List FROM Table List 

WHERE BOOKAUTHORS . AID-AUTHORS . ID AND BOOKAUTHORS . BID=BOOKS . ID 



3 BLOB Binary Large Object 
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3.6 The Run Time Workflow 

Let us consider Fig. 4., where the input sentence is ‘Show all books written by Mark 
Twain’. Based on the dictionary and on the production rules, the run time engine 
incrementally builds the attribute list, the table list and the SQL constraints as shown 
above. The tokens from the input sentence are matched with elements in the semantic 
sets or in the index files as show bellow: 



'Show' = no match 
'all' = no match 

'books' = matches one entry in the semantic set BOOKS 
'written' = no match 
'by' = no match 

'Mark' = matches one entry in the index file associated with 
AUTHORS . NAME 

'Twain' = matches one entry in the index file associated with 
AUTHORS . NAME 

The token ‘books’ is matched with table BOOKS. Because ‘Mark’ and Twain’ are 
neighbors in the input query and because they both point to the same value of the 
AUTHORS. NAME attribute in that order, they are merged into ‘Mark Twain’ and the 
correspondig SQL constraint becomes AUTHORS. NAME= ‘Mark Twain’. As shown 
in section 3.5 the tables AUTHORS and BOOKS form a N-to-N relation along with 
table BOOKAUTHORS. The three tables are included in the Table List. Because 
tables BOOK and AUTHOR are referenced in the matches, they are both included in 
the Attributes List: 

BOOKS , AUTHORS , BOOKAUTHORS ( 1 ) 

and the first two SQL constraints are: 

BOOKAUTHORS . AID- AUTHORS . ID AND BOOKAUTHORS . BID=BOOKS .ID ( 2 ) 

An additional SQL constraint is: 



AUTHORS .NAME= 'Mark Twain' (3) 

Based on (1), (2) and (3) the resulting SQL query is: 

SELECT AUTHORS . * , BOOKS . * FROM AUTHORS , BOOKS , BOOKAUTHORS 
WHERE BOOKAUTHORS. AID=AUTHORS. ID AND BOOKAUTHORS . BID=BOOKS . ID 
AND AUTHORS . NAME= ' Mark Twain' 



In the model presented above all attributes of the primary tables AUTHORS and 
BOOKS are in the SQL attribute list and the attributes of the table BOOKAUTHORS 
are not. 
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3.7 The Ambiguity Resolution for Twin Attributes Values 

If two tables share the same attribute value as in AUTHORS. NAME= 'Mark Twain’ 
and STUDENTS. NAME= ‘Mark Twain’ then there is ambiguity to whether BOOKS 
should be related to AUTHORS or to STUDENTS. The ambiguity is solved at run 
time. The token ‘written’ from the input sentence is processed through WordNet for 
Coordinate Terms. Here are the results: 



Results for 'Derivationally related forms' search of verb 'written' 
9 senses of write 
Sense 1 

RELATED TO->(noun) writer#l 

=> writer, author - 
RELATED TO->(noun) writing#l 

=> writing, authorship, composition, penning 



The dynamic semantic set associates ‘written’ to {writer, author, writing, authorship, 
composition, penning}. Comparing the elements of this set to the preprocessed se- 
mantic sets, it results that ‘written’ is related to AUTHORS set and not to the STU- 
DENTS. 

3.8 The Context Ambiguity Resolution 

Early work [8] showed how a number of different knowledge sources are necessary to 
perform automatic disambiguation. Others [7] used restrained input language and 
syntactic constructions that were allowed. Yet another approach is to introduce sto- 
chastic context definition as presented in [9] The present paper uses the database 
schema and WordNet to address the context ambiguity. 

Let us consider the tables described in Section 3.1 and the new table introduced be- 
low: 

Table BORROWEDBOOKS (LONG bid references BOOKS, LONG sid references 
STUDENTS ) 

Let the input sentence be ‘Who borrowed Romeo and Juliet?’ Following the 
workflow shown in section 3.6, the analyzer returns only this match: 



BOOKS . NAME- ' Romeo and Juliet' 



The table list includes only BOOKS and the resulting SQL query is: 

SELECT BOOKS . * FROM BOOKS 

WHERE BOOKS . NAME- ' Romeo and Juliet' 



The returned result set shows all attribute values for the selected book, however it 
does not show who borrowed the book, if this is the case. There are two approaches 
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for fixing the ambiguity: a. all table and attribute names in the database schema have 
meaningful names and b. the table BORROWEDBOOKS receives the semantic set of 
the token BORROWED. Both option require manual operations before or during the 
pre-processing time. The run time results are: 



'borrowed' is matched with the semantic set for BORROWEDBOOKS 
BORROWEDBOOKS table involves BOOKS and STUDENTS tables 
BOOKS and STUDNETS are in N-to-N relationship represented by the ta- 
ble BORROWEDBOOKS 

The Table List becomes BOOKS, STUDENTS, BORROWEDBOOKS 
From the database schema the first SQL constraint is BORROWED- 
BOOKS ,BID=BOOKS . ID AND BORROWEDBOOKS. SID=STUDENTS. ID 
From the TEXT index matching the second SQL constraint is 
BOOKS .NAME = 'Romeo and Juliet' 

The attribute list is BOOKS.*, STUDENTS.* 

The final SQL query is: 

SELECT BOOKS . * , STUDENTS . * FROM BOOKS , STUDENTS , BORROWEDBOOKS WHERE 
BORROWEDBOOKS . BID=BOOKS . ID AND BORROWEDBOOKS . SID=STUDENTS . ID AND 
BOOKS .NAME= 'Romeo and Juliet' 



The result set shows the required information, if any. 



4 Capabilities and Limitations 



The proposed method accepts complex queries. It can disambiguate the context if 
there are enough elements in the input that are successfully matched, as in the exam- 
ple shown bellow: 

'Show all romans written by Mark Twain and William Shakespeare that 
have been borrowed by John Markus 7 

The method retains the following tokens: 



' romans Mark Twain . . . William Shakespeare 

borrowed . . . John Markus ' 



The token ‘romans’ point to table BOOKS, ‘romans’ is found in the INDEX files for 
ENUM values of the attribute BOOKS. TYPE. Following the workflows presented in 
3.6 and 3.8, the method findings are: 

'romans 7 matches the ENUM value BOOKS . TYPE = 7 ROMAN 7 
'Mark Twain 7 matches the AUTHORS . NAME = 7 Mark Twain 7 

'William Shakespeare 7 matches the AUTHOR. NAME= 7 William Shakespeare 7 

'borrowed 7 is disambiguated at run time through BORROWEDBOOKS 

' John Markus 7 matches STUDENTS . NAME= 7 John Markus 7 

Schema correlates AUTHORS, BOOKS and BOOKAUTHORS 

Schema correlates STUDENT, BOOKS and BORROWEDBOOKS 

The table list becomes: AUTHORS, BOOKS, BOOKAUTHORS , STUDENTS , 

BORROWEDBOOKS 
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The SQL Constraints are: 

BOOKAUTHORS . AID-AUTHORS . ID AND BOOKAUTHORS . BID=BOOKS . ID 

AND BOOKS . TYPE= ' ROMAN ' 

AND ( AUTHORS. NAME= ' Mark Twain' OR AUTHORS .NAME= 'William Shakespeare') 

AND STUDENTS . NAME= ' John Markus' 

AND BOOKS . ID=BORROWEDBOOKS . BID AND STUDENT . ID=BORROWEDBOOKS . SID 

The two constraints to the AUTHORS.NAME have been OR-ed because they point to 
the same attribute. The method allows to construct the correct SQL query. As has 
already been shown in 3.7 and 3.8, the method can address context and value 
ambiguities. 

The current architecture.however, does not support operators such as: greater then, 
less then, count, average and sum. It does not resolve dates as in: before, after, 
between. The generated SQL does not support imbricated queries. The proposed 
method eliminates all tokens that cannot be matched with either the semantic sets or 
with the index files and it works for semantically stable databases. The prerpocessor 
must be used after each semantic update of the database in order to modify the index 
files. The context disambiguation is limited to the semantic sets related to a given 
schema. Errors related to tokenizing, WordNet and the human intervention propagate 
in the SQL query. The method completely disregards the unmatched tokes and thus it 
cannot correct the input query if it has errors. However, the method correctly 
interprets the tokens that are found in the semantic sets or among the derivationally 
related terms at run time. 



5 Future Work 

The future work will focus on the operator resolution as listed in the section 4. We 
believe that the approach presented in this paper can give good results with a mini- 
mum of effort in implementation and avoids specific problems related to the various 
existing semantic analyses approaches. This is partly made possible by the highly 
organized data in the RDBMS. The method will be implemented and the results will 
be measured against complex sentences involving more than 4 tables from the data- 
base. A study will be done to show the performance dependency on the size of the 
database records and on the database schema. 
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Abstract. In this paper we describe an interactive tool called the Avaya 
Interactive Dashboard, or AID, that was designed to help Avaya’s ser- 
vices organization to mine the text fields in Maestro. AID allows engi- 
neers to quickly and conveniently drill down to discover patterns and/or 
verify intuitions with a few simple clicks of the mouse. The interface has 
a web-based front-end that interacts through CGI scripts with a central 
server implemented mostly in Java. The central server in turn interacts 
with Maestro as needed. 



1 Introduction 

The Avaya problem database (Maestro) consists of approximately 5 million 
records. Each record in the database corresponds to a problem in one of Avaya’s 
deployed products and is populated either by the product itself (called alarms) 
via a self-diagnostic reporting mechanism, or by a service engineer via a problem 
reporting phone call. In the latter case, a human operator listens to the problem 
described by the engineer and creates the database record (called tickets) manu- 
ally. Presently, the Maestro database consists of approximately 4 million alarms 
and a million tickets. 

Each record in Maestro consists of several structured fields (name of cus- 
tomer, location, date of problem, type of product, etc.), and at least three un- 
structured text fields: problem description, one or more notes on the progress, 
and problem resolution description. The unstructured fields are restricted to 256 
bytes and this constraint forces the operators summarize and abbreviate liber- 
ally. As the problem is worked on by one or more engineers, the notes fields 
are updated accordingly. Upon resolution, the resolution field is updated. All 
updates to the text fields occur via the phone operators. The heavy use of sum- 
marization and abbreviations results in the data contained in these fields to be 
“dirty.” In other words, there are numerous typos, uses of inconsistent abbrevi- 
ations, non-standard acronyms, etc. in these fields. 
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The database is mined regularly by Avaya’s researchers and services engineers 
for discovering various patterns such as: frequency of alarms and tickets, time 
of occurrence, number of alarms or tickets per product or customer, etc. Thus 
far, only the structured fields were mined. However, there was a great need for 
mining the unstructured text fields as they contain rich information like the type 
of problem, the solution, etc. In fact, the unstructured data fields were being 
mined manually by a set of service engineers by first using database queries 
on the structured fields (to restrict the number of tickets returned) and then 
by manually eye balling the resulting tickets to discover patterns like types of 
problems, commonly occurring problems by customer/location, and commonly 
occurring problems across customers/locations for a particular product, etc. 

In this paper we describe an interactive tool called the Avaya Interactive 
Dashboard, or AID, that was designed to help Avaya’s services organization to 
mine the text fields in Maestro. AID allows engineers to quickly and conveniently 
drill down to discover patterns and/or verify intuitions with a few simple clicks of 
the mouse. The interface has a web-based front-end that interacts through CGI 
scripts with a central server implemented mostly in Java. The central server in 
turn interacts with Maestro as needed. 

2 The Design of AID 

AID provides search utilities on large-scale relational database with focus on text 
analysis and mining. As mentioned earlier, through automatic text analysis, AID 
allows service engineers to quickly and conveniently discover patterns and verify 
intuitions about problems. The design of this application complies with several 
objectives. In this section, we first briefly discuss these designing objectives, then 
we intensively discuss issues related to them, including functionalities, algorithms 
and application model. 

2.1 Design Goals 

AID emphasizes on three designing issues to provide advanced service: usabil- 
ity, precision and performance. High usability is a fundamental requirement for 
service oriented applications. It has two underline meanings. First, the services 
must be easy to use. The application interface should be simple and well ported 
on highly available system such as the Internet. And second, it should provide an 
array of functions for supporting different tasks. In general, search and mining 
utilities can simplify the tasks of automatically identifying relevant information 
or providing more specific information. Our design of functionalities are based 
on these two tasks with several additional features. Precision is a standard eval- 
uation metric for search utilities and text analysis. The algorithms we use in our 
approach must achieve high precision to help users directly target the problems. 
The TFIDF model in information retrieval is used for text similarity analysis and 
hierarchical clustering is used to group results respectively. In considering the 
performance requirement, we want our application to provide quick and highly 
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available services. AID is a centralized server application. The server must re- 
spond to queries quickly to achieve high availability if requests aggregate. 

2.2 Functionalities 

AID has two major functionalities: searching relevant tickets using keywords 
or sample tickets, and clustering a set of tickets into groups to identify more 
specific information. Keyword and sample ticket search is based on the text 
fields of tickets in the relational database. It can be constrained by non-text fields 
of tickets. Clustering is a standard method in data mining which may be well 
used on relational database. Our approach is very specific that it only performs 
clustering on text fields. Furthermore, two additional functions are provided: 
formatted retrieval of single ticket and categorizing tickets by non-text fields. 
These functionalities are well linked and one can be performed based the results 
of one another. For example, clustering can be performed on the results given by 
a keyword search and for each of the groups in clustered results, categorization 
can be applied. By combining a sequence of operations, such interactive functions 
give rich and flexible choices to identify problems very specific to users. 

2.3 Algorithms 

We use two well-studied algorithms in information retrieval and mining, TFIDF 
and hierarchical clustering, as part of the text analysis modules in AID. 



TFIDF. This relevance algorithm is originally proposed by Rocchio [4] for the 
vector space retrieval model [5]. In the vector space model, a text document is 
viewed as a bag of terms (or words) and represented by a vector in the vocabulary 
space. The fundamental algorithm of TFIDF chooses to compute the relevance 
score of two text documents by summing the products of weights for each term 
that appears in both text documents, normalized by the product of Euclidean 
vector length of the two text documents. The weight of a term is a function of 
term occurrence frequency (called term frequency ( TF )) in the document and the 
number of documents containing the term in collection (the inverse document 
frequency (IDF)). Mathematically, let N be the number of text documents in 
collection, x t j be the indicator of whether term i occurs in document j ( Xij = 1 
if term i in document j or 0 otherwise), the inverse document frequency of term 
i is defined as 

N 

IDFi = log (— tv )• (!) 

2_/j = 1 x ij 

And the relevance score of two text documents are given as 
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Hierarchical clustering based on text similarity. There are several dif- 
ferent clustering algorithms which has been studied intensively [2] . Hierarchical 
clustering is a bottom-up approach based on the distance of each pair of ele- 
ments in the beginning and distance of each pair of clusters in the intermediate 
process. Since clustering is also performed on text fields, we use TFIDF as the 
similarity measure to compute distance. Hierarchical clustering has performance 
in the order of square. 

2.4 Application Model 

AID is a centralized server application, which may be used by multiple users 
simultaneously. It is authorized to read data on relational database but not au- 
thorized to write to the database. The server maintains statelessness for multiple 
requests without caching any history operations performed by a certain user. If 
any task requires stateful transition, the client must provide complete informa- 
tion of the whole procedure for server to determine which action to perform. 
Such application model is very similar to a simple web server. Many implemen- 
tation and optimization techniques in building web servers can be safely used 
here to enhance performance. The main difference from a web server is that the 
backend is a database, not a file system. The performance of the server may be 
limited by the capacity of database query processing and data transferring rate. 

3 Implementation 

AID is a 3-tier server application: an independent interface, a multifunctional 
server and a backend database. We developed a web interface for AID to facilitate 
the using of AID search service. A CGI program accepts requests from the web 
and re-formats them compatible with the interface of AID server. AID server 
is the central service provider which organizes the internal logic and perform 
proper analysis. The backend database is the read-only data source to AID. The 
main code of AID is written in Java with minor combination with C. Since our 
motivation is not to develop a search utility serving a very large population, 
Java performance can properly satisfy our needs. 

3.1 The Architecture of AID Server 

In the basic scenario, AID server accepts and processes incoming requests, re- 
trieves data from the database, performs text analysis and sends out results in 
HTML format. Figure 1 shows the server components in the view of data flow. 
The server socket module maintains a pool of working threads. 

Each incoming request is assigned a working thread and some necessary re- 
sources. The request then is forwarded to the query manager which identifies 
which functionality is requested and pre-processes the request parameters be- 
fore database retrieval. Different functionalities require different parameters. In 
searching relevant tickets, the searching can be constrained by several non-text 
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fields, such as product name/ID, customer name/ID, date, severity level. In clus- 
tering a set of tickets, a number of ticket identifiers have to be provided. These 
parameters are corrected and formatted to be compatible with database for- 
mats. The database module is responsible for querying and retrieving data from 
database. Since we allow multi-request simultaneous processing, we have imple- 
mented a high level SQL query processor which can handle multiple queries in 
parallel. Internally, we only maintain a few connections to database but multiple 
JDBC statements to handle queries. This approach has several good features: 

— Good utilization the network bandwidth between AID server and database. 
Since AID server is the only agent to communicate with database when 
multiple users are using AID service, a few connections are sufficient to 
transfer data over network. 

— Avoids overflow of number of database connections. If each working thread 
processing a request occupies a connection to database, the database connec- 
tions will be overburdened and the overhead for initializing many connections 
is significant. 

— Queries can be executed simultaneously using multiple JDBC statements 
over a few connections. The database schedules data transferring based on 
availability. So in a given connection, the data for any query will not block 
even if the data of previous queries is uot ready. 

Text analysis module is the kernel component in our application. Many tech- 
niques of text processing are used here to provide high quality search service. 
We will discuss it in detail in the next subsection. When results are ready, the 
response module outputs HTML in a unified format in order to present results 
to users with a consistent look and feel. 




Fig. 1 . AID server. 
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3.2 The Architecture of Text Analysis Module 

Most of the functionalities provided by AID rely on text analysis. Figure 2 
presents the architecture of text analysis module in AID server. All original text 
data, both text data retrieved from database and the given keywords, must be 
cleaned and translated by the text filter module. The module removes common 
stop words and performs a dictionary lookup for expanding the most commonly 
used abbreviations, of many abbreviations which are frequently used by techni- 
cians who write the text data. For example, “TOOS” is the abbreviation of “To- 
tally Out Of Service.” When computing relevant tickets, the keywords/sample 
ticket and the retrieved tickets are passed into Relevance Evaluator to compute 
relevance score. However, for clustering or categorizing, the filtered data goes di- 
rectly into Clustering module or categorizing module. Both Relevance Evaluator 
and Clustering make use of the TFIDF module which computes the similarity of 
two pieces of texts. Relevance searching requires 1-to-N similarity computation 
while hierarchical clustering is N-to-N computation. Therefore we present two 
different interfaces in TFIDF module accordingly. Once the Relevance Evaluator 
computes a score, it forwards the result immediately to Output Manager. The 
Output Manager module maintains an array of tickets with limited size. Only 
those with top relevance scores are put in the array. In this way, we cache only 
a small portion of retrieved data for memory efficiency. Finally the results are 
sorted for output. 

It is worthwhile discussing the effect of document frequency in computing rel- 
evance score in TFIDF algorithm. Online computation of document frequencies 
is very expensive because it requires the complete scan of the database. Alter- 
natively, we offline scan the database to collect them and regularly update them 
while AID server is running. The vocabulary in the collection contains 10,708 
different terms which are cached in memory. As lookup is the only operation 



Structured data fields | * | Database module [ « »| Database 




► Response Module 



| Data module | | Functional module 



Fig. 2. The Text Analysis Module in AID server. 
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on document frequencies, the data is stored in a hash table for optimal lookup 
efficiency. AID can be configured to search in a subset of the database. 

3.3 Multi-thread Model for Server Development 

Our read-only server development is similar to a single web server where the 
backend data source comes from a database. Pai [3] discusses the properties of 
several architectures in building a portal web server and shows the overhead of 
multi-thread model mostly comes from the disk asyncronization in file system 
and scheduling of threads in operating system. AID server has little use of local 
file system after startup so that it does not introduce much overhead like multi- 
thread web servers. Furthermore, we use a thread pool maintaining working 
threads so that when requests aggregate AID server does not initiate many new 
threads and the overhead for scheduling threads is greatly reduced. 

4 Example 

The usefulness of AID is best demonstrated with a detailed example. Figure 3 
shows how AID can be useful in helping mine the text fields of Maestro. The first 
box in the figure shows the sample ticket that is used to search the database 
(for similar tickets). The sample ticket describes a platinum customer whose 
paging system is totally out of service. There is no overhead music. The ticket 




Fig. 3. Example on the use of AID 
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is assigned a severity value of 4 (the highest possible) and an engineer has been 
dispatched to look at the problem. The second box shows the top 3 tickets that 
were returned using AID’s search capabilities. All these tickets show a platinum 
customer having trouble with their paging system (totally out of service with no 
overhead music). 

Using AID’s clustering algorithms on the set of tickets returned as a result 
of the search shows a trend in the types of problems occurring in the paging 
systems. There are three main types of paging system problems: music, power, 
and speakers (box 3). 

A manual analysis of the tickets returned by the search routines show that 
a large number of tickets report problems with a platinum customer (indicating 
a top customer). AID allows the user to categorize the tickets by customer, 
location, product, etc. Using the categorization function of AID by customer 
shows that the platinum customers referred to in all the tickets in box 2 are the 
same. If the user wants to dig deeper and analyze what locations are affected 
he/she can re-categorize the results in box 2 by location. Moreover, the user has 
the option of using the categorization feature to verify if the there is a recurrence 
of a particular problem type (as shown in box 3) at a particular location. 

Suppose, armed with the additional knowledge that platinum customer ABC 
Corporation is experiencing problems with their paging systems, the user wants 
to investigate further. The investigation now can take several possible routes: 

1. The user may want to narrow the search to only ABC Corporation and search 
for all paging system problems there. This is simply done by returning to 
box 1 and by placing a restriction on the customer (in addition to providing 
the sample ticket). 

2. The user may want to use assess the timelines of the problems, both at 
ABC Corporation and also for paging systems in general. This is done by 
returning to box 1 and placing a restriction on date ranges, product types, 
and/or customers. 

3. The user may want to discover if, for any of these cases, there were any 
prior indication that a complete outage (severity 4) was going to occur. The 
indications may have been prior tickets with lower severity codes (such as 
music being intermittent, etc.) that were called in. Once again, the user can 
return to box 1 and search for tickets by placing a restriction on the severity 
code. 

5 Evaluation 

Given the fact that Maestro contains approximately one million tickets, a com- 
prehensive evaluation of the interface is not possible. However, we have solicited 
comments from a few users of the interface and are in the process of incorporat- 
ing changes based upon their feedback. The main benefit of using AID has been 
the increased productivity as the tool helps the user to quickly and conveniently 
drill down to the desired level. The slowest part of the system, the clustering 
module, takes under a minute to cluster a few hundred tickets. 
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6 Conclusion 

In this paper we have described an interactive tool called Avaya Interactive 
Dashboard, or AID, that allows for quick and convenient mining of Avaya’s 
problem ticket database. The web-based front end of the system interacts with 
the user and formats the users’ queries for the central server which in turn 
interacts with the problem ticket database. The interface has helped improve 
the quality and the productivity of mining the unstructured text fields greatly. 
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Abstract. In recent literature it is commonly agreed that the first phase 
of the software development process is still an area of concern. Further- 
more, while software technology has been changed and improved rapidly, 
the way of working and managing this process have remained behind. 

In this paper focus is on the process of information modeling, its quality 
and the required competencies of its participants (domain experts and 
system analysts). The competencies are discussed and motivated assum- 
ing natural language is the main communication vehicle between domain 
expert and system analyst. As a result, these competencies provide the 
clue for the effectiveness of the process of information modeling. 



1 Introduction 

Nowadays many methods exist for the development process of software. A num- 
ber of examples are: Iterative development ([1], [2]), Evolutionary development , 
Incremental development ([3]), Reuse- oriented development ([4]) and Formal sys- 
tems development ([5]). 

As different as all these development processes may be, there are fundamen- 
tal activities common to all. One of these activities is requirements engineering 
(RE), although this activity has its own rules in each development method. RE is 
the process of discovering the purpose for which the software system is meant, by 
identifying stakeholders and their needs, and documenting these in a form that 
is amenable to analysis, communication, negotiation, decision-making (see [6]) 
and subsequent implementation. For an extensive overview of the held of RE we 
refer to [7]. 

Experts in the area of software engineering do agree that RE is the most 
import factor for the success of the ultimate solution for reasons that this phase 
closes the gap between the concrete and abstract way of viewing at phenomena 
in application domains ([8], [9]). As a consequence, during the RE process, the 
involved information objects from the Universe of Discourse (UoD) have to be 
identified and described formally. We will refer to this process as Information 
Modeling (IM). The resulting model will serve as the common base for under- 
standing and communication, while engineering the requirements. 
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Where different areas of expertise meet, natural language may be seen as 
the base mechanism for communication. It is for this reason that each general 
modeling technique should support this basis for communication to some extent. 
As a consequence, the quality of the modeling process is bounded by the qual- 
ity of concretizing into an informal description augmented with the quality of 
abstracting from this description. 

In this paper focus is on information modeling as an exchange process be- 
tween a domain expert and a system analyst. Our intention is to describe this 
exchange process, and the underlying assumptions on its participants. Using 
natural language in a formalized way can, for example, be seen as a supplement 
to the concept of use cases (see [10]). 

Roughly speaking, a domain expert can be characterized as someone with (1) 
superior detail-knowledge of the UoD but often (2) minor powers of abstraction 
from that same UoD. The characterization of a system analyst is the direct 
opposite. We will describe the required skills of both system analysts and domain 
experts from this strict dichotomy and pay attention to the areas where they 
(should) meet. Of course, in practice this separation is less strict. Note that as a 
result of the interaction during the modeling process the participants will learn 
from each other. The system analyst will become more or less a domain expert, 
while the domain expert will develop a more abstract view on the UoD in terms 
of the concepts of the modeling technique. This learning process has a positive 
influence on effectiveness and efficiency of the modeling process, both qualitative 
(in terms of the result) and quantitative (in terms of completion time). 

According to [11] the IM process is refined into four more detailed phases. 
Note that this process is not necessarily to be completed before other phases of 
system development, i.e. design and implementation, can start. At each moment 
during this process the formal specification may be realized. Still the IM process 
may be recognized in the various methods for system development as discussed 
in the beginning of this section. 

In order to initiate the information modeling process the system analyst must 
elicit an initial problem specification from domain experts. This is referred to as 
requirements elicitation. In this phase domain knowledge and user requirements 
are gathered in interactive sessions with domain experts and system analysts. 
Besides traditional techniques, more enhanced techniques may be applied, e.g. 
cognitive or contextual techniques. For more elicitation techniques, see [7] or [12]. 

The requirements elicitation results in an informal specification, also referred 
to as the requirements document. As natural language is human’s essential vehicle 
to convey ideas, this requirements document is written in natural language. In 
case of an evolutionary development, the previous requirements document will 
be used as a starting point. 

In an iterative process of modeling, verification and validation the informal 
specification evolves to a complete formal specification, also referred to as a 
conceptual model. The primary task of the system analyst is to map the sentences 
of this informal specification onto concepts of the particular conceptual modeling 
technique used. As a side product, a sample population of the concepts derived 
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from the example instances may be obtained. Using the formal syntax rules 
of the underlying modeling technique, the formal specification can be verified. 
The conceptual model in turn can be translated into a comprehensible format. 
For some purposes a prototype is the preferable format, for other purposes a 
description is better suited. In this paper we restrict ourselves to a description 
in terms of natural language sentences that is to be validated by the domain 
expert. The example population serves as a source when instantiated sentences 
are to be constructed, thereby creating a feedback loop in the IM process. This 
translation process is called paraphrasing. 

Basically, the conceptual model may be seen as a generative device (grammar) 
capable to generate not only the informal specification, but also all other feasible 
states of the UoD. 

The correctness of this way of working depends on whether the formal spec- 
ification is a proper derivate of the informal specification which in its turn must 
be a true reflection of the UoD. Being a proper derivate is also referred to as 
the falsification principle, which states that the derived formal specification is 
deemed to be correct as long as it does not conflict with the informal specifi- 
cation. Being a true reflection is referred to as the completeness principle. It is 
falsified when a possible state of the UoD is not captured, or when the grammar 
can derive an unintended description of a UoD state. 

These two principles require the participants of the modeling process to have 
some specific competencies. For instance, a domain expert should be able to 
come up with significant sample information objects. On the other hand a system 
analyst should be able to make a general model out of the sample information 
objects such that the model describes all other samples. Section 3 discusses these 
competencies in more detail. 

The effectiveness of this way of working depends on how well its participants 
can accomplish their share, i.e. (1) how well can a domain expert provide a 
domain description, (2) how well can a domain expert validate a paraphrased 
description, (3) how well can a system analyst map sentences onto concepts, and 
(4) how well can a system analyst evaluate a validation. At least one iteration of 
the modeling loop is required, a bound on the maximal number of iterations is 
not available in this simple model. Usually modeling techniques tend to focus on 
modeling concepts and an associated tooling but less on the process of modeling, 
i.e. the way of working. To minimize and to control the number of iterations, 
this way of working requires methods that support this highly interactive process 
between domain experts and system analysts. Furthermore, guarantees for the 
quality of this process and the required competencies of the participants involved 
during RE are insufficiently addressed by most common methods. 

In this paper focus is on the IM process, its quality and the required compe- 
tencies of the participants in this process. Section 2 describes the IM process in 
more detail. The competencies of the participants of the IM process is subject 
of discussion in section 3. In section 4 the correctness of this way of working is 
verified and sketched. 
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2 The Information Modeling Process 

In order to make a more fundamental statement about the quality of IM, we 
describe in figure 1 the modeling process in more depth by further elaborating the 
activities of elicitation , modeling and validation. We will also make more explicit 
what elements can be distinguished in a formal specification. In the next section 
we will make explicit what competencies are required from its participants, and 
use these competencies to verify this process. 

The processes represented by the arrows labelled 4 upto 8 are suitable for au- 
tomation support . A formal theory for these processes and corresponding models 
is elaborated in [13] and may be used as the underlying framework for building 
a supporting tool. 



2.1 Elicitation 

The stage elicitation is refined in figure 1 into the following substages (the num- 
bers refer to the numbers in the figure): 

1. Collect significant information objects from the application domain. 

2. Verbalize these information objects in a common language. 

3. Reformulate the initial specification into a unifying format. 

The communication in the UoD may employ all kinds of information objects, 
for example text, graphics, etc. However, a textual description serves as a unify- 
ing format for all different media. Therefore the so-called principle of universal 
linguistic expressibility ([14]) is a presupposition for this modeling process: 

All relevant information can and must be made explicit in a verbal way. 

The way of working in this phase may benefit from linguistic tools, see [11], for 
example to detect similarities and ambiguities between sentences. 

Independent of what modeling technique used, the (sentences in the) initial 
specification are to be reformulated in accordance with some unifying formatting 
strategy, leading to the informal specification. At this point the modeling process 
may start, during which the involvement of the domain expert is not required. 

Only few conceptual modeling techniques provide the domain expert and 
system analyst with clues and rules which can be applied on the initial speci- 
fication, such that the resulting informal specification can actually be used in 
the modeling process. Examples of such modeling techniques are NIAM ([15]), 
Object-Role Modeling ([16]) and PSM ([17]). 



2.2 Modeling 

The intention of the modeling phase is to transform an informal specification 
into a formal specification. This phase can be decomposed into the following 
substages (see figure 1): 
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4. Discover significant modeling concepts (syntactical categories) and their re- 

lationships. 

5. Match sentence structure on modeling concepts. 

During the grammatical analysis, syntactical variation is reduced to some syn- 
tactical normal form. The next step is to abstract from the sentence structures 
within the normal form specification and to match these sentence structures onto 
concepts of the modeling technique used. On the one hand, this step results in 
a structural framework (the conceptual model) and the rules for handling this 
framework (the constraints). Furthermore, the formal specification describes how 
these concepts can be addressed. Without loss of generality, we can state that 
this part of the specification contains a lexicon and rules to describe compound 
concepts (the grammar rules). As a side product, a sample population is obtained 
and put at the disposal of the validation activity. 

The elementary and structure sentences of the normal form specification pro- 
vide a simple and effective handle for obtaining the underlying conceptual model 
of so-called Snapshot Information Systems (see e.g. [18]), i.e. information sys- 
tems where only the current state of the UoD is relevant. However, even though 
these informal specifications are an important aid in modeling information sys- 
tems, they are still too poorly structured. One of the missing elements is the 
order and history of events. The mutual order of the sentences in an informal 
specification is lost, the analyst has to reconstruct this order. Other missing 
structural UoD properties are for instance related to the associates involved in 
events, and the role in which they are involved (see for example [19]). 

2.3 Validation and Verification 

The validation phase is further refined in figure 1 into the following stages: 

6. Produce by paraphrases a textual description of the conceptual model using 

the information grammar and the lexicon. 

7. Validate the textual description by comparing this description with the in- 

formal specification. 

8. Check formal specification for internal consistency, e.g., check model for flex- 
ibility. 

Once the information grammar is obtained as part of the formal specification, 
this grammar may be used to communicate the derived models to the domain 
experts for validation purposes. Two ways of validating the analysis models are 
considered: 

1. producing a textual description of (part of) an analysis model using the 
information grammar. 

2. obtaining a parser from the information grammar which provides the domain 
expert with the ability to check whether sentences of the expert language 
are captured by the (current version of the) information grammar. Note that 
this will be very useful when prototyping is used. 

For an example of how to construct such a validation mechanism, see [20] . 
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3 Information Modeling Competencies 

The previous section describes IM from a process point of view. In this section, 
we discuss the actors and investigate the competencies that are required to 
provoke the intended behavior. Note that the processes labelled 4, 5, 6 and 8 
have associated a single actor, i.e. the system analyst, while the other processes 
(labelled 1, 2, 3 and 7) have more actors. We will not consider any algorithmic 
synchronization aspects of cooperation between actors. Whether domain expert 
or system analyst actually are teams, is also not considered. 

There are some interesting issues that are not discussed in this paper. For 
instance, social skills, such as negotiating and communicating, are out of scope. 
How the competencies are measured, enforced in practice, and how to check if 
they are applied, is also not discussed. 

3.1 Domain Experts 

A first base skill for a domain expert is the completeness base skill: 

D-l. Domain experts can provide a complete set of information objects. 

As obvious as this skill might seem, its impact is rather high: the skill is the 
foundation for correctness of the information system and provides insight into 
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the requirements for those who communicate the UoD to the system analyst 
during step 1. Furthermore, the skill is required to fulfill the completion princi- 
ple as introduced in section 1. A second foundation for correctness, which also 
is introduced in section 1, is the falsification principle, which states that the 
derived formal specification is deemed to be correct if it does not conflict with 
the informal specification. To support this principle system analysts should have 
suitable skills, which are the topic of the next subsection. 

The next step, labelled with verbalization, is the composition of a proper de- 
scription of the information objects. It requires the domain expert to be familiar 
with the background of these objects, and being capable to describe them in 
every aspect. This is the purpose of the provision base skill: 

D-2. Domain experts can provide any number of significant sample sentences in 
relation to relevant information objects. 

This competence is a bottleneck for verbalization. As this base skill does not 
state that a domain expert can provide a complete set of significant examples 
by a single request from the system analyst, some more base skills that describe 
aspects of the communication between domain experts and system analysts are 
necessary. 

A prerequisite for conceptual modeling is that sentences are elementary, i.e. 
not splittable without loss of information. As a system analyst is not assumed 
to be familiar with the semantics of sample sentences, it is up to the domain 
expert to judge about splitting sentences. This is expressed by the splitting base 
skill : 

D-3. Domain experts can split sample sentences into elementary sentences. 

A major advantage and also drawback of natural language is that there are 
usually several ways to express one particular event. For example, passive sen- 
tences can be reformulated into active sentences. By reformulating all sample 
sentences in a uniform way a system analyst can detect important syntactical 
categories during grammatical analysis of these reformulated sample sentences. 
The domain expert is responsible for reformulating the sample sentences, which 
is expressed by the normalization base skill: 

D-4. Domain experts can reformidate sample sentences into a unifying format. 

In order to capture the dynamics of the UoD the sentences need to be ordered. 
As a result, the domain expert has to be able to order the sample sentences. 
This is captured in the ordering base skill: 

D-5. Domain experts can order the sample sentences according to the dynamics 
of the application domain. 

The skills D-3, D-4 and D-5 are essential for reformulation (step 2 of the IM 
process). 

During the modeling process, the information grammar (and thus the con- 
ceptual model) is constructed in a number of steps. After each step a provisional 
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information grammar is obtained, which can be used in the subsequent steps to 
communicate with domain experts. First a description of the model so far can be 
presented to the domain experts for validation. In the second place, the system 
analyst may confront the domain expert with a sample state of the UoD for val- 
idation. The goal of the system analyst might be to detect a specific constraint 
or to explore a subtype hierarchy. This is based on the validation base skill: 

D-6. Domain experts can validate a description of their application domain. 

This skill is essential for the validation in step 7 of the IM process. During this 
step of the IM process, the system analyst may check completeness by suggesting 
new sample sentences to the domain expert. This is based on the significance 
base skill: 

D-7. Domain experts can judge the significance of a sample sentence. 

The performance of the analysis process and the quality of its result benefits 
from the capability of the domain expert to become familiar with the concepts 
of the modeling technique. This is expressed by the conceptuality base skill: 

D-8. Domain experts have certain abilities to think on an abstract level. 

In contrast with the former skills, this latter skill is not required but highly desir- 
able. This skill has a positive effect on all steps of the requirements engineering 
process. 

3.2 System Analysts 

Besides studying the cognitive identity of domain experts it is necessary to in- 
vestigate the cognitive identity of system analysts. 

Domain experts tend to leave out (unconsciously) those aspects in the appli- 
cation domain which are experienced as generally known and daily routine. An 
inexperienced system analyst may overlook such aspects leading to discussions 
on the explicit aspects, which usually addresses exceptions rather than rules. In 
order to detect implicit knowledge, the system analyst should not take things 
for granted, but should rather start with a minimal bias with a clean slate. This 
is expressed by the tabula rasa base skill: 

A-l. System analysts can handle implicit knowledge. 

This base skill is most related to the effectivity of the modeling process and the 
quality of its result, and is imperative during steps 1 and 2. 

A next skill for system analysts is the so-called consistency base skill: 

A-2. System analysts can validate a set of sample sentences for consistency. 

The system analyst will benefit from this base skill especially during step 3. 
A major step in a natural language based modeling process is the detection of 
certain important syntactical categories. Therefore a system analyst must be 
able to perform a (possibly partially automated) grammatical analysis of the 
sample sentences. The result of this grammatical analysis is a number of related 
syntactical categories (step 4). This leads to the grammar base skill: 
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A-3. System analysts can perform a grammatical analysis on a set of sample 
sentences. 

A system analyst is expected (during step 4) to make abstractions from detailed 
information provided by domain experts. By having more instances of some sort 
of sentence, the system analyst will get an impression of the underlying sentence 
structure and its appearances. The sentence structures all together form the 
structure of the information grammar. This is addressed in the abstraction base 
skill rule, which states that system analysts should be able to abstract from the 
result of the grammatical analysis: 

A-4. System analysts can abstract sentence structures from a set of related syn- 
tactical categories. 

Then (during step 5) a system analyst must be able to match sentence structures 
found with the abstraction base skill with the concepts of a particular conceptual 
modeling technique. This is expressed by the modeling base skill rule: 

A-5. System analysts can match abstract sentence structures with concepts of a 
modeling technique. 

As abstraction of sentences is based on the recognition of concrete manifesta- 
tions, the system analyst must be able to generate new sample sentences which 
are validated by the domain experts. This enables the system analyst to for- 
mulate hypotheses via sample sentences. This way the system analyst has a 
mechanism to check boundaries, leading to data model constraints. Being able 
to generate sample sentences is expressed by the generation base skill: 

A-6. System analysts can generate new sample sentences. 

For example the system expert might resolve a case of ambiguity by offering 
(in step 7) well suited sample sentences to the domain expert for validation. 
This base skill is related to the ability of the system analyst to get acquainted 
with nature of the application domain, i.e., the counterpart of base skill D-8 for 
the domain expert. Note that base skill A-6 is required to give execution to the 
falsification principle as introduced in section 1. 

Finally, the system analyst is expected to control the quality of the analysis 
models against requirements of well-formedness, i.e. the system analyst must 
satisfy the fundamental base skill: 

A-7. System analysts can think on an abstract level. 

The skills presented for the system analysts focus on a natural language based 
modeling process. Of course, system analysts and system designers also need 
expertise for using and understanding modeling techniques. In [21] a conceptual 
framework is proposed for examining these types of expertise. The components 
of this framework are applied to each phase of the development process and used 
to provide guidelines for the level of expertise developers might strive to obtain. 
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4 Controlling Natural Language 

As stated in [22] , natural language is the vehicle of our thoughts and their com- 
munication. Since good communication between system analyst and domain ex- 
pert is essential for obtaining the intended information system, the communica- 
tion between these two partners should be in a common language. Consequently 
natural language can be seen as a basis for communication between these two 
partners. Preferably natural language is used in both the modeling process as 
well as the validation process. In practice, system analyst and domain expert 
will have gained some knowledge of each other’s expertise, making the role of 
natural language less emphasized, moving informal specification towards formal 
specification. 

For the modeling process, natural language has the potential to be a precise 
specification language provided it is used well. Whereas a formal specification 
can never capture the pragmatics of a system, an initial specification in natural 
language provides clear hints on the way the users wants to communicate in the 
future with the information system. 

Since modeling can be seen as mapping natural language concepts onto mod- 
eling technique concepts, paraphrasing can be seen as the inverse mapping in- 
tended as a feedback mechanism. This feedback mechanism increases the possi- 
bilities for domain experts to validate the formal specification, see e.g. [20] and 
([23]). Besides the purpose of validation, paraphrasing is also useful to (1) lower 
the conceptual barrier of the domain expert, (2) to ease the understanding of 
the conceptual modeling formalism for the domain expert and (3) to ease the 
understanding of the UoD for the system analyst. 

Up to now we focussed on the positive aspects of the usage of natural lan- 
guage, but in practice there are not many people who can use natural language 
in a (1) complete, (2) non-verbose, (3) unambiguous, (4) consistent way, (5) ex- 
pressed on a uniform level of abstraction. In the sequel of this section we will 
make it plausible how the base skills for domain experts and systems analysts 
can reduce the impact of the these disadvantages, as these are the main crit- 
ical succes factors of IM, of which the efficiency and effectiveness is a direct 
consequence. 

The completeness and provision base skills (D-l and D-2) anticipate on the 
completeness problem of natural language specifications. Skill D-2 states that 
domain experts can provide any number of significant sample sentences based on 
these information objects. Assuming that each UoD can be described by a finite 
number of structurally different sample sentences, the probability of missing 
some sentence structure decreases with each new sample sentence generated 
by the domain expert. Furthermore, the system analyst may generate sample 
sentences for validation in order to test correctness and completeness aspects of 
the formal specification sofar. These interactions are triggered, controlled and 
guided by the system analyst, as stated in the tabula rasa base skill A-l and the 
consistency base skill A-6, to aim at convergence. 

Specifications in natural language tend to be verbose, hiding essentials in 
linguistic variety. Complex (verbose) sentences will be feed back to the domain 
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expert for splitting (splitting base skill D-3) and judging significance for the 
problem domain (significance base skill D-7). A natural language specification 
may also get verbose by exemplification, providing examples (instantiations) of 
the same sentence structure. The grammar base skill A-3 reflects the ability of 
the system analyst to recognize the underlying similarity whereas base axiom A-4 
provides the ability to abstract from superfluous sample sentences. Furthermore, 
the ability of domain experts to reformulate sentences in a unifying format (base 
axiom D-4) and to order sample sentences (base axiom D-5) is also helpful to 
eliminate the woolliness of initial specifications. 

An often raised problem of natural language usage is ambiguity, i.e. sentences 
with the same sentence structure yet having a different meaning. The system 
analyst should have a nose for detecting ambiguities. The consistency base skill 
A- 2 provides the system analyst with the required quality. A typical clue comes 
from establishing peculiarities in instantiated sentences. In order to decide about 
a suspected ambiguity, the system analyst will offer the domain expert these 
sentences for validation (generation base skill A-6 and validation base skill D-6). 

On the other hand, the system analyst may also wish to elicit further expla- 
nation from the domain expert by requesting alternative formulations or more 
sample sentences with respect to the suspected ambiguity (provision base skill 
D-2). 

The consistency base skill A-2 guarantees that the system analyst is equipped 
with the ability to verify a natural language specification for consistency. Just like 
the entire conceptual modeling process, consistency checking of natural language 
specifications has an iterative character. Furthermore, consistency checking re- 
quires interaction with the domain expert, as a system analyst may have either 
a request for more sample sentences (provision base skill D-2), or a request to 
validate new sample sentences (generation base skill A-6, validation base skill 
D-6 and significance base skill D-7). 

Sentences of a natural language specification are often on a mixed level of 
abstraction. As a system analyst has limited detail knowledge, and thus also 
limited knowledge at the instance level, a prerequisite for abstraction is typing 
of instances (grammar base skill A-3 and abstraction base skill A-4) and map 
these types on the concepts of a modeling technique (modeling base skill A-5). 
The analysis of instances within a sentence is in fact a form of typing , attributing 
types to each of its components. As an example of such a sentence consider: The 
Rolling Stones record the song Paint It Black. Some instances will be typed by 
the domain expert (the song Paint It Black) while others are untyped ( The 
Rolling Stones). This may be resolved by applying a type inference mechanism, 
to untyped instances (see [19]). Typed sentences can be presented to the domain 
expert for validation (base skill D-6) . 

Although a formal mathematical proof can not be provided the above line of 
reasoning makes it plausible that the disadvantages of natural language usage 
can be overcome if the participants in cle the requirements engineering process 
do have the required skills. 
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Abstract. This paper presents the results of three experiments designed to as- 
sess the extent to which a Natural-Language Processing (NLP) tool improves 
the quality of conceptual models, specifically object-oriented ones. Our main 
experimental hypothesis is that the quality of a domain class model is higher if 
its development is supported by a NLP system. The tool used for the experi- 
ment - named NL-OOPS - extracts classes and associations from a knowledge 
base realized by a deep semantic analysis of a sample text. In our experiments, 
we had groups working with and without the tool, and then compared and 
evaluated the final class models they produced. The results of the experiments 
give insights on the state of the art of NL-based Computer Aided Software En- 
gineering (CASE) tools and allow identifying important guidelines to improve 
their performance, highlighting which of the linguistic tasks are more critical to 
effectively support conceptual modelling. 



1 Introduction 

According to the results of a market research whose aim was to analyse the potential 
demand for a CASE tool integrating linguistic instruments as support for requirements 
analysis, 79% of requirements documents are couched in unrestricted NL. Also the 
majority of developers (64%) pointed out that the most useful thing to improve gen- 
eral efficiency in modelling user requirements would be a higher level of automation 
[20]. However, there is still no commercial NL-based CASE tool. There have been 
many attempts to develop tools that support requirements engineering since the '80s. 
The objective of this work was to evaluate how the NL-based CASE tools can support 
the modelling process, thereby speeding up requirements formalisation. This research 
makes use of linguistic techniques which were considered state-of-the-art at that time, 
although newer technologies have now been developed. In this work we present the 
results of a set of experiments designed to investigate the extent to which a state-of- 
the-art NLP tool that supports the semi-automatic construction of a conceptual model 
improves their quality. The tool used for the experiments - named NL-OOPS - ex- 
tracts classes and associations from a knowledge base realized by a deep semantic 
analysis of a sample text [13]. In particular, NL-OOPS produces class models at dif- 
ferent levels of detail by exploiting class hierarchies in the knowledge base of the 
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NLP system and marks ambiguities in the text [16], [17], [18], In our experiments, we 
had groups and individuals working with and without the tool, and then compared and 
evaluated the final class models they produced. The results of the experiments give 
some insight on the state of the art of NL-based CASE tools and identify some im- 
portant parameters for improving their performances. 

Section 2 of the paper presents related research projects that use linguistic tools of 
different complexity to support conceptual modelling. Section 3 describes the main 
features of NL-OOPS and basic knowledge upon which the NLP system is built. Sec- 
tion 4 outlines the stages of the experiments and contains an evaluation of the models 
produced, focusing on the effect that NL-OOPS had on their quality. The concluding 
section summarises the findings of the experiment and describes directions for future 
research. 



2 Theoretical Background 

The use of linguistic tools to support conceptual modelling was proposed by the large 
number of studies. We will report here only some of the most important related works 
to illustrate the efforts toward the development of NL-based CASE tools. In the early 
1980's, Abbott [1] proposed an approach to program design based on linguistic analy- 
sis of informal strategies written in English. This approach was further developed by 
Booch [4], who proposed a syntactic analysis of the problem description. Saeki, Ho- 
rai, and Enomoto [27] were the first to use linguistic tools for requirements analysis. 
Dunn and Orlowska [10] described a NL interpreter for the construction of NIAM 
conceptual schemas. Another relevant work was the expert system ALECSI [6] that 
used a semantic network to represent domain knowledge [25]. Along similar lines, 
Cockburn [7] investigated the application of linguistic metaphors to object-oriented 
design. One of the first attempts to apply automated tools to requirements analysis is 
proposed in [9], Goldin and Berry [14] introduce the approach for finding abstractions 
in NL text using signal processing methods. The COLOR-X project attempts to 
minimize the participation of the user in the job of extracting classes and relationships 
from the text [5], A method whose objective was to eliminate ambiguity in NL re- 
quirements by using a Controlled Language is presented in [22]. The approach de- 
scribed in [3] for producing interactively conceptual models of NL requirements was 
to use a domain dictionary and a set of fuzzy-logic rules. Among the recent projects, 
an interactive method and a prototype - LIDA - is described in [24]. However, most 
of the text analysis remains a manual process. Another research area connected both 
with the conceptual modelling and NLP is investigation of the quality of require- 
ments’ language [17], [23]. In particular, some authors focused on the support of 
writing requirements [2] . 

In this context, NL-OOPS 1 - the tool used in the experiments and described in the 
next section - presents a higher degree of automation. It is based on a large NLP sys- 
tem [13], that made it possible to produce completely automatically a draft of a con- 
ceptual model starting from narrative text in unrestricted NL (English) [16], 



1 http://nl-oops.cs.unitn.it 
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3 The NL-OOPS Tool 

There are two complementary approaches to develop a tool for extraction from textual 
descriptions the elements necessary to design and build conceptual models. The first 
limits the use of NL to a subset that can be analysed syntactically. Various dialects of 
"Structured English" do that. The drawback of this method is that it will not work for 
real text. The second approach adopts NLP systems capable of understanding the 
content of documents by means of a semantic, or deep, analysis. The obvious advan- 
tage of such systems is their application to arbitrary NL text. Moreover, such systems 
can cope with ambiguities in syntax, semantics, pragmatics, or discourse. Clearly, 
compared to the first category, they are much more complex, require further research, 
and have a limited scope. 

NL-OOPS is an NL-based CASE prototype. It was founded on LOLITA (Large- 
scale Object-based Language Interactor, Translator and Analyser) NLP system, which 
includes all the functions for analysis of NL: morphology, parsing (1500-rules gram- 
mar), semantic and pragmatic analysis, inference, and generation [13]. The knowl- 
edge base of the system consists of a semantic network, which contains about 150,000 
nodes. Thus LOLITA is among the largest implemented NLP systems. Documents in 
English are analysed by LOLITA and their content is stored in its knowledge base, 
adding new nodes to its semantic network. NL-OOPS prototype implements an algo- 
rithm for the extraction of classes and associations from the semantic network of LO- 
LITA. The NL-OOPS’s interface consists of three frames [15]. First one contains the 
text being analysed, the second frame gives a partial representation of the SemNet 
structures used by LOLITA for the analysis of the document. After running the mod- 
elling module, the third frame gives a version of the class model. The tool can export 
intermediate results to a Word file or a Java source file; traceability function allows 
the user to check what nodes were created for a given sentence. The nodes browser of 
NL-OOPS makes available further information related to a specific node, e.g. the hi- 
erarchies in which it is involved. 



4 The Experiments 



Our main experimental hypothesis was that when model development is supported by 
a NLP-based tool, the quality of the domain class model is higher and the design pro- 
ductivity increases. Consequently the goal of the experiments was to confirm or refute 
this assumption and then to identify the features and the linguistic tasks that effective 
NL-based CASE system should include. 



4.1 Realization of the Experiments 

In each experiment, we assigned a software requirements document to the participants 
and we asked them to develop in a given time a class domain model, identifying 
classes, associations, multiplicity, attributes and methods. Half of the participants 
were supported by the NL-OOPS tool. They were also given some training in the use 
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of NL-OOPS functionalities: changing the threshold for the algorithm to produce 
models at different levels of detail; viewing the list of candidate classes, and navigat- 
ing the nodes browser of the knowledge base. The chosen class model could then be 
deployed in a java file, which was reverse engineered into PoseidonCE 2 or Rational 
Rose 3 class diagrams. Both these means create the list of classes and the analyst has 
only to drag them in the diagram, and then to check and complete the diagram. Before 
the experiment, we administered a short questionnaire to assess the experience of the 
participants. To compare the results of the experiments, we used the same require- 
ments text in all the experiments. In particular, the text named Softcom [26], deals 
with a problem that requires some familiarity with judged sports. The language is 
quite simple, but also realistic. It contains all the typical features for requirements 
text: use of the passive voice, etc. The second case study named Library [11] had a 
level of difficulty similar to that of the Softcom case. Both texts are cited in [15]. The 
first two experiments involved couples of analysts. The first experiment focused on 
the quality of the models obtained with and without NL-OOPS; in the second, partici- 
pants were asked to save the diagrams at fixed time intervals, to obtain data also about 
productivity. In the third experiment, each analyst worked alone developing two mod- 
els for two different problem statements, one with the tool and one without it. For the 
first two experiments participants were undergraduate students and their competence 
in object-oriented conceptual modelling was comparable to that of junior analysts. For 
the last experiment participants were PhD students with higher competence. 

The classes suggested by NL-OOPS with different thresholds are given for Soft- 
com in and Library in Table 1 and 2, respectively (threshold influences on the depth 
of LOLITA’s semantic network hierarchies). These classes constitute the main input 
for the analysts working with NL-OOPS. To interpret the results we refer to the class 
models proposed with the problem sources [11], [26] (last column in table 1 and 2). 
The names of the classes in the reference models are in bold. This choice was made to 
minimize the subjectivity in the evaluation of the models produced by the partici- 
pants. We calculated recall (R, counts the number of correct identified classes divided 
by total number of correct classes), precision (P, counts the number of correct identi- 
fied classes divided by total number of classes), and the F-measure (combines R and 
P) to evaluate the performance for the class identification task [29]. 

The models proposed by NL-OOPS do not contain classes; they instead present in 
the list of candidate classes. In the first two cases two classes are indicated: entity 
(worker), entity (announcer), corresponding to ambiguity in the text. In the first case, 
entity was introduced by the NLP system for the sentence “Working from stations, the 
judges can score many competitions”: it cannot be automatically assumed that the 
subject of working is “judges”. The second class results from an analysis of the sen- 
tence “In a particular competition, competitors receive a number which is announced 
and used to split them into groups”, where the subject of announces is unknown. The 
use of the node browser of NL-OOPS allows to go back to the original phrase to de- 
termine whether it gives the information necessary for the model. 



2 Gentleware: http://www.gentleware.com 

3 IBM - Rational Software: www.rational.com 
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Table 1. Classes identified by NL-OOPS: SoftCom case 



NLOOPS-1 (12) 


NL-OOPS-2 (10) 


NL-OOPS-3 (5) 


Reference classes (11) 


Competition 






Competition 


Competition (subclass 
of Competition in 
SemNet) 








Competitor 


Competitor 


Competitor 


Competitor 


Entity (worker) 


Entity (worker) 






Entity (announcer) 


Entity 

(announcer) 












Figure 

(styles, routines) 


Group 


Group 


Group 




High 


High 






Judge 


Judge 


Judge 


Judge 








League 


Meeting 


Meeting 




Meeting 


Number 


Number 


Number 




Score 


Score 


Score 


Score 








Season 








Station 








Team 








Trial 


Softcom 


Softcom 






R=45.5%; P=41.7%, 
F-measure =43.5% 


R=36.4%; P=40.0%, 
F-measure = 38.1% 


R=27.3%; P=60.0%, 
F-measure =37.5% 


R avK =36.4%;P avg =47.2% 

F-measure avg =39.7% 



* Word in parenthesis corresponds to the actual meaning of concept shown implicitly in network 



For the Library case, the measures of the class identification task are higher than 
for the SoftCom case. However, the quality of the models produced by NL-OOPS is 
reduced by the presence of classes due to unresolved anaphoric references (“It”, “En- 
tity”, “Pair”), or to ambiguity in the sentences. For example, the subject in sentence 
“The reservation is cancelled when the borrower check out the book or magazine or 
through an explicit cancelling procedure” is omitted. Another spurious class is “Mur- 
der”, which was introduced by LOLITA as subject of an event related to the “re- 
move”-action (due to the absence of domain knowledge). 



4.2 Analysis of the Results 

Evaluating the quality of the models is a subjective process. The experience gained 
from the experiments and the analysis of the literature about quality of conceptual 
model , 4 helped us to define a schema to support their evaluation. The schema take into 



4 There are only few papers about this topic; see for example, [21], [28], 
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Table 2. Classes identified by NL-OOPS: Library case 



NLOOPS-1 (17) 


NL-OOPS-2 (10) 


NL-OOPS-3 (7) 


Reference classes 
(7) 


Book 


Book 


Book 


Book 


Borrower 


Borrower 


Borrower 


Borrower 


Person 








Copy 






Item (Book Copy; 
Magazine Copy) * 


Employee 


Employee 






Entity (delete, up- 
date, create) 


Entity (delete, update, 
create) 


Entity (delete, 
update, create) 




Entity (cancel) 








Entity (register) 








Software_System 


Software_System 


Soft- 

ware_System 




It (Library) 


It (Library) 


It (Library) 




Entity (Library) 








Library 


Library 


Library 




Loan 






Loan 


Magazine 


Magazine 


Magazine 


Magazine 


Murderer (remove) 








Purchase 








Reservation 


Reservation 




Reservation 


Thing (superclass of 
Book and Magazine) 


Thing (superclass of 
Book and Magazine) 






Pair (superclass of 
Book and Magazine) 








Title 






Title (Book Title; 
Magazine Title) * 


R = 100%, P= 41.2% 
F-measure = 58.4% 


R=57.1-71.4%, P=40.0- 
50.0%** 

F-measure =47.0-58.8% 


R=42.9%, 

P=42.9%, 

F-measure 

=42.9% 


R. avK =66.7%-71.4%; 

P a v K =41-4%-44.7%; 

F-measure av)! 

=49.4%-53.4% 



* The hierarchies for Copy and Title represent two alternatives used to evaluate the model developed by the 
students 

** The maximum values were calculated including the class Thing 



account the criteria related to both external and internal quality (considered in [ 28 ]), 
evaluating: 

- the content of the model i.e. semantic quality: how much and how deep the model 
represents the problem, 

- the form of the model i.e. syntactic quality ( proper and extensive use of UML no- 
tation, operating all the variety of UML expressions such of aggregation, inheri- 
tance, etc.), 

- the quantity of identified items (in class model: number of classes, attributes, op- 
erations, associations, and hierarchies). 
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Each of these criteria reflects a particular aspect of model quality. To evaluate each 
one we assign a scale with a range from 0 (lowest mark) to 5 (highest mark). The 
overall quality of the models is measured basing on mixed approach: 

- the application of the quality schema, 

- the evaluation by two experts, that assessed the models as they usually do for their 
students’ projects. 

For the class identification task, which is a part of the conceptual model design task 
the quality was evaluated calculating recall, precision, and the F-measure. 



First Experiment. In the first preliminary experiment the group of twelve stu/ 
dents was split into six subgroups [19]. Each group had access to a PC with Micro- 
soft Office 2000 while carrying out the experiment. Three groups worked with NL- 
OOPS. The length of the experiment was 90 minutes. For the six diagrams pro- 
duced, two groups used PowerPoint, one used Excel, and all groups working 
without NL-OOPS chose Word. The results of the identification task are given in 
Table 3. 



Table 3. Class identification 





1 tool 


2 tool 


3 tool 


1 


2 


3 


Recall 


72.7% 


54.5% 


81.8% 


100.0% 


100.0% 


81.8% 


Precision 


88.9% 


66.7% 


69.2% 


78.6% 


68.8% 


90.0% 


F-measure 


80.0% 


60.0% 


75.0% 


88.0% 


81.5% 


85.7% 



To evaluate the overall quality of class diagrams, we asked two experts to mark and 
comment on the solutions proposed by the different groups (table 4). The experts 
judged the best model to be the one produced by group 5, in which two of the stu- 
dents had used UML for real projects. So, if on this basis, it was excluded, in order 
to have comparable level of groups, the best model would be one developed with 
the support of NL-OOPS. Considering these results with those in table 1, the tool 
seemed to have an inertial effect that on one hand led to the tacit acceptance of 
classes (e.g., group); on the other hand it resulted in the failure to introduce some 
indispensable classes (e.g., season, team). From the analysis of the feedbacks given 
by the participants some considerations emerged: (a) those who used NL-OOPS 
would have preferred more training; (b) each group that used NL-OOPS would pre- 
fer to have a tool to design the diagrams, while groups working without the tool did 
not voice this preference. All these considerations were used for the realisation of 
the subsequent experiments. 
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Table 4. Experts' evaluation of overall model quality 



Groups 


Quality 


1 tool 


pretty good 


2 tool 


low 


3 tool 


good 


4 


pretty good 


5 


good 


6 


low 



Second Experiment. We repeated the experiment later with involving students. Par- 
ticipants were divided into five groups: two of them used NL-OOPS. The participants 
had access to Poseidon CE. We asked them to produce the model of the domain 
classes for the problem assigned. The length of the experiment was set for 1 hour. As 
we wanted to obtain also information regarding the productivity of conceptual model- 
ling supported by linguistic tools, we asked the students to save every fifteen minutes 
screen shot of their model. The performances related to the class identification task 
are summarised in the table 5, we report average recall, precision and F-measure for 
both groups. 



Table 5. Class identification 





15’ 


30’ 


45’ 


60’ 


Recall 

tool 


45.5% 


69.7% 


75.7% 


75.7% 


50.0% 


59.1% 


63.6% 


81.8% 


Precision 

tool 


33.4% 


71.3% 


74.2% 


76.4% 


52.6% 


60.3% 


72.9% 


73.3% 


F-measure 

tool 


38.5% 


66.2% 


73.5% 


74.7% 


50.9% 


59.2% 


67.3% 


75.8% 



Marks and comments on the overall quality made by two experts are given in table 6. 



Table 6. Expert evaluation of overall model quality 



Groups 


Quality 


1 


low 


2 


pretty good 


3 


good 


4 tool 


pretty good 


5 tool 


low 



The experts judged the best model to be the one produced by group 3 which partici- 
pants (according to the questionnaire) used UML for real projects. The application of 
the quality schema described in section 4.2 gives the results in table 7. 
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Table 7. Overall Quality 





Content 


Form 


Items 


Total | 


Time 


15’ 


30’ 


45’ 


60’ 


15’ 


30’ 


45’ 


60’ 


15’ 


30’ 


45’ 


60’ 


15’ 


30’ 


45’ 


60’ 


no tool 


0.0 


1.7 


2.0 


3.3 


0.0 


1.7 


2.1 


3.7 


1.1 


2.6 


2.9 


3.9 


0.4 


2.0 


2.3 


3.6 


With tool 


3.0 


3.3 


2.5 


3.5 


3.0 


3.2 


2.5 


2.8 


2.2 


2.2 


3.4 


4.1 


2.8 


2.9 


2.8 


3.5 



Third Experiment. In the third experiment we made some more changes. First of 
all, the participants worked individually. They had to deal with two different prob- 
lem statements of comparable difficulty, with and without the NL-OOPS prototype. 
We set the length of the experiment to 20 minutes for each case. As in the second 
experiment we decide to collect progressive results, so we asked them to save the 
model in an intermediate file after the first 10 minutes. The results for the class 
identification task are presented in the table 8. We should comment that even 
though the experts chose the requirement texts of comparable level, for the linguis- 
tic analysis there was the difference. For instance. Library case turned to be 
more difficult for the NLP system to understand because it contains many anaphors 
(table 1-2). 



Table 8. Class identification 



Parameter 


Case 


10‘ 


20’ i 


Recall 

tool 


softcom 


51.5% 


49.6% 


70.9% 


62.2% 


library 


47.6% 


53.6% 


softcom 


47.6% 


59.5% 


53.6% 


62.5% 


library 


71.4% 


71.4% 


Precision 

tool 


softcom 


85.2% 


75.9% 


72.5% 


62.0% 


library 


66.7% 


51.5% 


softcom 


66.7% 


58.3% 


51.5% 


50.8% 


library 


50.0% 


50.2% 


F-measure 

tool 


softcom 


64.2% 


59.9% 


71.7% 


62.1% 


library 


55.6% 


52.5% 


softcom 


55.6% 


57.2% 


52.5% 


55.7% 


library 


58.8% 


59.0% 



We can assume here existence of some inertial effect because the users tend to keep 
all the candidate classes provided by NL-OOPS without getting rid of the fake 
classes (“it”, “thing”, “entity”, etc.). 

The application of the quality schema described in section 4.2 gives the results in 
table 9. Marks and comments on the overall quality made by two experts are given 
in table 10. In this experiment both quality and productivity had been improved 
thanks to the support of the NL-OOPS tool, even though the participants were pes- 
simistic about using such kind of linguistic instrument. 
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Table 9. Overall Quality 



Parameter 


Case 


10' 


20' ! 


Content 

tool 


softcom 


1.4 


1.3 


3.9 


3.5 


library 


1.1 


3 


softcom 


2.1 


2.1 


3.5 


3.9 


library 


2 


4.2 


Form 

tool 


softcom 


1.5 


1.4 


3.8 


3.4 


library 


1.3 


3 


softcom 


2.4 


2.5 


4.2 


4.5 


library 


2.6 


4.9 


Items 

tool 


softcom 


0.7 


1 


3 


3.1 


library 


1.2 


3.2 


softcom 


1.7 


2 


4.1 


4.4 


library 


2.2 


4.7 


Total 

tool 


softcom 


1.2 


1.2 


3.6 


3.3 


library 


1.2 


3.1 


softcom 


2.1 


2.2 


3.9 


4.3 


library 


2.3 


4.6 



Table 10. Expert evaluation of overall model quality 



Person 


Evaluation 


Time 


10’ 10’ tool 


20’ 20’ tool 


1 


low pretty good 


low good 


2 

3 


* 

low good 


pretty good good 
low good 


4 


low pretty good 


low good 


5 

6 


low pretty good 


pretty good good 
pretty good good 


7** 


- 


- 


8 


low 


pretty good low 


9 


low good 


low good 


10 


low - 


low low 



*Grey cells correspond to Softcom 

**Person 7 violated the rules of experiment, so the data cannot be considered as correct 



5 Conclusions 

The empirical results from the three experiments neither confirm nor refute the initial 
hypothesis of this paper that the quality of a domain class model is higher if its devel- 
opment is supported by a NLP system. There is some evidence, however, that model 
quality is better for NLP tool users early on during the modelling process. See the 
results of the third experiment at 10 minutes in Table 9. As to the impact of a NLP 
tool on productivity, the results of the experiments are uniformly inconclusive, but 
there is some evidence in Tables 7 and 9 that users work faster when supported by the 
tool. We interpret these results to mean that at initial steps the tool is helpful in 
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speeding up the work, but by the end of the process, the advantage is lost because the 
users have to go into details of the text anyway to verify the correctness of list of 
classes and to derive other elements of the class diagrams. The prototype was in some 
ways misused, as users were not able to take advantage of all the functionality pro- 
vided by the system. Apparently, the groups working with the tool used only the ini- 
tial draft of the class model and only part of the list of the candidate classes produced 
by the tool. A user did not go deep into the details of the semantic network con- 
structed by the system and focused his/her attention only on the final list of the most 
probable classes candidates. To avoid this effect, NL-OOPS should have better inte- 
gration of the different types of information it generates with the diagram visualiza- 
tion tools. On the methodological level, the quality evaluation schema and the ap- 
proach we adopted for the experiments described in this paper for the evaluation of 
NL-OOPS can be used to evaluate the output produced by any case tool designed to 
support the modelling process. Other lessons learned from the experiments regarding 
features for an effective NLP-based CASE tool, include: 

- the knowledge base produced by the linguistic analysis must be presentable in a 
user-understandable form, 

- the most and least probable class and relationship candidates should be highlighted, 
to help the user modify the final model, either by extending it with other classes or 
by deleting irrelevant ones, 

- the tool should be interactive to allow the analyst to resolve ambiguities and reflect 
these changes in the semantic representation immediately. 

In general terms, the experiments confirm that, given the state of the art for NLP sys- 
tems, heavyweight tools are not effective in supporting conceptual model construc- 
tion. Instead, it makes sense to adopt lightweight linguistic tools that can be tailored 
to particular linguistic analysis tasks and scale up. Moreover, linguistic analysis may 
be more useful for large textual documents that need to be analysed quickly (but not 
necessarily very accurately), rather than short documents that need to be analysed 
carefully. We will be focusing our future research towards this direction. 
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Abstract. This paper discusses an approach to tracking decisions made in 
meetings from documentation such as minutes and storing them in such a way 
as to support efficient retrieval. Decisions are intended to inform future actions 
and activities but over time the decisions and their rationale are often forgotten. 
Our studies have found that decisions, their rationale and the relationships be- 
tween decisions are frequently not recorded or often buried deeply in text. Con- 
sequently, subsequent decisions are delayed or misinformed. Recently, there 
has been an increased interest in the preservation of group knowledge invested 
in the development of systems and a corresponding increase in the technologies 
used for capturing the information. This results in huge information reposito- 
ries. However, the existing support for processing the vast amount of informa- 
tion is insufficient. We seek to uncover and track decisions in order to make 
them readily available for future use, thus reducing rework. 



1 Introduction 

Solutions to systems engineering 1 problems are products of collaborative work over a 
period of time. Several people with varied expertise and experience invest their 
knowledge in the product. During the product’s development, several decisions are 
made. Some are about the product, others about the process. These decisions can 
broadly be classified into two: system decisions and process decisions. System deci- 
sions deal with the technical aspects of a system such as its features, architecture, reli- 
ability, safety, usability etc. Process decisions include responsibility delegation, mile- 
stones, follow-on actions, schedules, even budget. We are interested particularly in 
process decisions that require actions to be taken in order to progress a project. This 
work forms part of our research on the Tracker 2 project where we seek to understand 
the nature of decisions in teams and organisations; in particular the way past decisions 
are acted on, referred to, forgotten about and otherwise function as part of long term 
organizational activity. 

Meetings are important activities in collaborative work. They represent activities 
within a process. There is a history of previous meetings, a constellation of concepts 
and documents that are brought into the meeting, which often evolves as a result of 
the meeting, and are taken into work activities afterwards [1], The meetings drive 



1 The development of software systems 

2 http://www.comp.lancs.ac.uk/computing/research/cseg/projects/tracker/ 

F. Meziane and E. Metais (Eds.): NLDB 2004. LNCS 3136, pp. 147-158. 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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processes, which in turn are driven by the outcomes of these processes. The outcome 
of a meeting is communicated through minutes. The standard practice in many or- 
ganisations is to circulate minutes to the meetings participants (the list of attendees, 
apologies and absentees). However, some decisions require the attention of people 
who aren’t members of the meetings. Consequently, decisions must be extracted from 
minutes and communicated to the external audience. The failure to meet this respon- 
sibility can lead to misinformed decisions or a delay in subsequent decisions being 
made. 

We argue that finding, associating and communicating decisions can be automated. 
Natural Language Processing (NLP) techniques can be used for identifying action 
verbs (actions) and subjects (agents) in sentences. Anecdotal evidence shows that 
minutes are often circulated late, just before a subsequent meeting. Then it is too late 
to remember what transpired in the previous meeting or to extract decisions and 
communicate them to external audience. In addition, few people read the minutes in 
good time even if the minutes are circulated early. Over time the minutes become too 
voluminous and overwhelming to analyse manually. As a result participants go into 
meetings ill prepared leading to unsatisfactory participation. 

In the requirements engineering field, language resources have been used to iden- 
tify systems decisions (candidate features) by picking out verbs from requirements 
documents [9, 10]. In the same way, nouns can be identified. This is important be- 
cause nouns are considered candidate objects in Object-Oriented technology. Fur- 
thermore requirements can be categorised based on modal verbs into mandatory, less 
obligatory etc [10]. 

A similar approach can be used for identifying actions 3 and finding associations 
between actions. Automatic extraction of actions and relationships between them can 
provide the means for interpreting process decisions. This provides similar support to 
the system engineering process as the identification of verbs and nouns for system 
decisions. Also, actions provide a means for tracking the progress of a process and 
may also be used to estimate individual contribution in a collaborative work. 



2 Tokenisation and Linguistic Annotation 

Before we can analyse text we need to tokenise it. Tokenisation involves breaking a 
document into constituent parts based on the structure and style of the document. The 
tokenised text can be represented as plain text or in XML format. XML based repre- 
sentation provides a better way to separate content from its meta data (including for- 
matting information). Thus, the two concerns, content and rendering, can be ad- 
dressed separately. Meta data is used for analysing content and formatting informa- 
tion for rendering the content. Many NLP tools require plain text and represent se- 
mantic or part-of speech information as plain text. For example, using C7 part-ol' 
speech (pos) and semantic tag set [6, 9], Item 1 would be represented in plain text as 
follows: 

Item/ id=’T.l”/ pos=”NN l’V sem=”02” 1/ id =”1.2”/ pos=”MCl'7 sem=”N 1”. 



3 A statement which specifies what should be done and who should do it. 
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In XML, the previous example would be represented as 
<w id=”2.1” pos=”NN 1” sem=”02’'>Item</ w> 

<w id=”2.2” pos=”MCl” sem=”Nl”>l</ w> 

Where w = word, id = the ordinal number of a word in a sentence, pos = part of 
speech and sent = semantic category. We will adopt the latter representation. For the 
purpose of this paper, we will discuss how to convert a Microsoft Word document 
internally represented in rich text format (.rtf), or word document (.doc) to XML. 



2.1 Document Style and Structure Based Tokenisation and Annotation 

We are developing a decision capture tool (Fig. 4.) which accepts a document in rich 
text format (.rtf) or Microsoft Word document format (.doc) as input and returns an 
XML document. Typically, word documents consist of sections, paragraphs, sen- 
tences and words and use different formatting styles such as italics , underline and 
bold face. Such information represents hidden meaning, for example a paragraph 
break may indicate the introduction of a new issue or a different viewpoint. Similarly 
non-identical font style between two adjacent paragraphs suggests the introduction of 
a different item or viewpoint. The tool uses such information to introduce annotations 
to a document for example paragraphs have <paragraph> tag around them (Fig. 2.). In 
addition, indicator words such as agenda (issue), minute (statement) or action are 
added to enclose the document units’ annotations. Indicator words and phrases have 
been shown to be useful in automatic summarisation [8], 

A structural tagger uses the template, (Fig. 1.) to introduce appropriate annotations. 
The element paragraph consists of a series of sentences, represented by element ‘s’ 
and each sentence contains a series of words, represented by element ‘w’. Each word 
has attributes ‘id’ (ordinal number), ‘sem’ (semantic) are ‘pos’ (part-of-speech). 

The structural tagger breaks the text into paragraph units. The linguistic tagger (see 
Linguistic Annotation) adds the smaller units such as sentences and words. 




Fig. 1 . Structural layers of a word document 
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Fig. 2. Identifying and annotating text constituencies 



Fig. 2. shows a document after structural annotations are introduced. The first 
paragraph is identified as an issue because it is bold face in the original document. 
Similarly, the second paragraph is identified as an issue because the first line is in 
bold face. However, the third paragraph is not an issue. Since it comes immediately 
after an issue paragraph, it is identified as a minute statement. It is logical to argue 
that a paragraph immediately after an agenda item and with a different formatting 
style is a minute statement. 

The converted (structurally annotated) document is then passed through a linguistic 
tagger. The linguistic tagger further breaks the document into sentences and words 
and introduces meta data such as ordinal number (id), part-of-speech (pos), and se- 
mantic (sem) category (see examples under Linguistic Annotation). The output is a 
document that conforms to the template in Fig. 1 . 



2.2 Linguistic Annotation 

We use two existing NLP tools, namely CLAWS 4 and Semantic Analyser 5 that em- 
ploy a hybrid approach consisting of rule-based and probabilistic NLP techniques. 
CLAWS [7] is an important tool for processing English language text. It uses a statis- 
tical hidden Markov model technique and a rule-based component to identify the 
parts-of-speech of words to an accuracy of 97-98%. The Semantic Analyser [10] uses 
the POS-tagged text to assign semantic annotations that represent the general seman- 
tic field of words from a lexicon of single words and an idiom list of multi-word 
combinations (e.g. ‘as a rule’). These language resources contain approximately 
61,400 words and idioms and classify them according to a hierarchy of semantic 
classes. 



4 http://www.comp.lancs.ac.uk/ucrel/claws/ 

5 http://www.comp.lancs.ac.uk/ucrel/usas/ 
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The tools were trained on large corpora of free text that had been analysed and 
‘tagged’ with each word’s syntactic or semantic category. Extremely large corpora 
have been compiled, for instance, the British National Corpus consists of approxi- 
mately 100 million words [2]. For some levels of analysis, notably POS annotation, 
probabilistic NLP tools have been able to achieve very high levels of accuracy and 
robustness unconstrained by the richness of language used or the volume of docu- 
mentation. 

Such techniques extract interesting properties of text that a human user can com- 
bine and use to infer meanings. Evidence from other domains suggests that such tools 
can effectively support the analysis of large documentary sources. For example, NLP 
tools were used to confirm the results of a painstaking manual discourse analysis of 
doctor-patient interaction [11] and they also revealed information that had not been 
discovered manually. 



3 Analysing Content 

Words are shells within which meaning lies. When we break the shell, the inner 
meaning emerges. At the surface is the literal (surface) meaning; deeper within is 
figurative and metaphorical meaning. For example, the Kiswahili word safari invokes 
different meanings to different people. Literally, safari is a Swahili word for journey, 
a long and arduous one. However, it invokes a different meaning amongst the West- 
ern societies than the Swahili speaking communities of East Africa. In the West, sa- 
fari is almost synonymous with touring and game viewing. In this perspective, it is a 
happy outing. 



3.1 Surface Analysis 

Surface analysis involves the identification of indicator words in a domain and con- 
straining their meaning within the domain. Using indicator words, we can determine 
specific elements of information in a document. For example, in a student’s answer 
booklet, the word Answer is used to suggest a solution even though we are aware that 
some are incorrect. If taken literally, all the worked solutions would be correct. In this 
case, the use of the word Answer serves to identify and separate solutions from each 
other. In documents such as minutes of meetings, the same approach can be used for 
the analysis of elements such as agenda items (issues) and minutes (solution state- 
ment). In a minutes document, agenda items are often marked with the tag ‘agenda’ 
followed by an ordinal number. Similarly, minutes are marked with a minute tag fol- 
lowed by a number. 

Our approach incorporates document structure and style to distinguish ordinary 
numbered items from minute elements (Fig. 2.). The body of a minutes’ document 
consists of agendum-minute pair. We have observed that these sections are formatted 
differently. A new element starts on a new paragraph and a new paragraph with a dif- 
ferent formatting from the previous one suggests a new element. For example, agenda 
items frequently have formatting such as italics, bold face or underline. Using such 
styles, in combination with the surface meaning of the indicator words or word 
phrases, we can retrieve agenda items and minute statements. This approach is suit- 
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able within a domain. If taken outside the domain, the meaning of indicator words 
may change considerably. 

While our approach coped fairly well with the set of minutes used in the study 
(Table 1), it is not our claim that it will work in every situation. A standard document 
template is desirable. We are developing a template for taking minutes that can be 
used with our tool. 



3.2 Semantic (Deep) Analysis 

Since surface implies something exists underneath, we require deeper analysis to dis- 
cover the meaning of sentences in a certain context. Such meaning is represented in 
the part-of-speech and semantic categories of Natural Language Processing. Analysis 
at this level can reveal more information, e.g. two sentences can be shown to be 
similar even though different words or syntax is used. For example, the following 
sentences carry the same meaning but use different words and syntax: 

a) Site XYZ should develop test cases for module Al 

b) Site XYZ to create test cases for module Al 

In these examples, the words develop and create, or their inflected forms have se- 
mantic category values A2.1 and A2.2 respectively. A2.1 means affect: modify or 
change, A2.2 means affect: cause 6 . These are both subdivision of the affect category. 
Similarly, words with the same semantic categories (semantic chains) could be used 
in places of develop or create. 

c) <s> <w id=”2.F’pos=”NNl”sem=”M7”>Site</w> <w id=”2.2” pos=”FO” sem=”Z3c”>XYZ</w> <w 

id=”2.3” pos=”VM” sem=’'S6+”>should</w><w id=”2.4” pos=”WI” sem=”A2.1+”>develop</w> 
<w id=”2.5” pos=”NNl” sem=”Pl”>test</w> <w id=”2.6” pos=”NN2” sem=”A4.1”>cases</w> <w 
id=”2.7” pos=”IF” sem=”Z5”>for</w> <w id=”2.8” pos=”NNl” sem=”Pl”>module</w> <w 
id=”2.9” pos=”FO” sem=’"Z2”>Al</w> <w id=”2.10” pos=”.” sem=”PUNC”>.</w> </s> 

d) <s><w id=’"3.1” pos=”NNl” sem=”M7”>Site</w> <w id=”3.2” pos=”FO” sem=”Z3c”>XYZ</w> <w 

id=”3.3” pos=”TO” sem=’"Z5”>to</w> <w id=”3.4” pos=”WI” sem=”A2.2”>create</w> <w 
id=”3.5” pos-'RP” sem=”N2[il.2.2”>out</w><w id=”3.5” pos=”NNl” sem=”Pl”>test</w> <w 
id=”3.6” pos=”NN2” sem=”A4.1”>cases</w> <w id=”3.7” pos=”IF” sem=”Z5”>for</w> <w 
id=”3.8” pos=”NNl” sem=”Pl”>module</w> <w id=”3.9” pos=”FO” sem=”Z2”>Al</w> <w 
id=”3.10” pos=”.” sem=”PUNC”>.</w> </s> 

The sentences have a subject (Site XYZ) and an object (test cases). The subject can 
be syntactically arranged so that it appears at the head (a & b) or the tail of a sentence. 
In examples (a and b), the sentences are of the form: 
subject + infinitive verb (object) 

We studied several minutes’ documents to discover semantic and part-of-speech 
patterns that represented actions statements. Below are some examples from the min- 
utes. 

e) Jeff to provide Activity 2 document in template format before Monday 26 June. 

f) Harry and Eileen to provide some text on qualitative work for Activity 4 to Jeff by 

Monday morning 26 June. 

We came to the hypothesis that actions that will occur in the future identify agent 
or agents (who) and the action (what). Also, we observed that the sentences contained 
a modal verb or a function word ‘to’. Further, agents and action-describing verbs are 
connected by modal verbs or the function word ‘to’. We argue that a simple future 



6 http://www.comp.lancs.ac.uk/ucrel/usas/semtags.txt 




Language Resources and Tools for Supporting the System Engineering Process 



153 



action sentence exists in three parts: a noun or noun phrase, a function word ‘to’ or a 
modal verb and an action verb or verb phrase. The noun phrase may be a proper noun 
such as the name of a person e.g. Tom, a geographical name such as London or com- 
mon nouns such as names of groups of people e.g. partners, members etc. A verb ex- 
presses the action (what will happen). It may also contain a subordinate clause, for 
example a constraint. 

Davidson’s [5] treatment of action verbs (verbs that describes what someone did, is 
doing or will do) as containing a place for singular terms or variables is consistent 
with our observation. For example, the statement “Amundsen flew to the North Pole” 
is represented as (3x ) (Flew(Amundsen, North Pole, x)). Davidson called the repre- 
sentation a logical form of action sentence. Though Davidson considered actions that 
had occurred, the logical form also applies to future actions. For example, the action 
sentences (e) and (f) above can be expressed in Davidson’s logical form as: ( 3.x ) 
(Provide(Jeff, Activity 2 document in template format, x)) and ( 3.x ) (Provide) Harry 
and Eileen, some text on qualitative work for Activity 4, x ) & To(Jeff, x) respectively. 
The strength of Davidson’s logical form of action sentence is it’s extensibility as il- 
lustrated in example (f) above. 

From these results, we developed the following protocol for extracting action sen- 
tences. 

Noun + function word ‘to’/modal verb + infinitive verb pi 

The above protocol can be represented as a template as: 

semantic=”agent” word=”to”, POS="infinitive verb” 

where agent matches the noun phrase and the verb phrase matches both the infini- 
tive verb and the subordinate clause. 

It is important to note that the noun, modal verb and action verb must occur in a 
particular order in any action sentence. The sentences have a leading noun or noun 
phrase followed by a modal verb or a function word ‘to’ and an infinitive verb. If the 
order changes, the meaning is altered. For example, the modal verb ‘can’ conveys the 
sense of ability to do something e.g. “I can drive”. It can also be used to pose a ques- 
tion - “Can you drive?”. Similarly, other modal verbs such as could, shall, should, 
must, may, might, will, would, ought to can be used in the same way. The senses 
that modal verbs convey include ability, advice, expectation, necessity, request, ques- 
tion and possibility. 

In the C7 tag set 7 , all nouns have a part-of-speech category that starts with N. For 
example, NN represents common noun, NP proper noun etc. Using such information, 
we are able to identify nouns in sentences. Similarly, verbs have part-of-speech values 
that start with V. For example VB* 8 represents the verb ‘be’ with all its inflected 
forms, VD* represents the verb ‘do’ and its inflected forms. The function word ‘to’ 
has a part-of-speech value of TO and semantic categories of either X7 or S6 which 
indicate planning, choosing, obligation or necessity. It is possible to map part-of- 
speech and semantic categories onto the protocol (pi) above as follows: 

N*+ (TO/ VM) + V*= action statement p2 

The protocol (p2) retrieved not only all the actions that we identified by intuitive 
interpretation of the same text but also identified others. Fig. 4. shows a list of actions 
in a contrast background in a web browser. 



7 http://www.comp.lancs.ac.uk/ucrel/claws7tags.html 

8 one or more letters 
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Address M http://148.88. 157. 151 :8080/bec/testcase.xml ~^1 o | 

3 

Minutes 

Item 1 — Minutes of the last meeting 

Various items from the last meeting were raised as separate agenda items . Other outstanding maters included 
studentships and advising CyberForce members of progress : 



-Jones reported that he had already written to the CyberForce to advise them that the the project | 
proposal had been successful but that he would now take the opp ortunity to write again and inform 
them of the appointment of the Industrial Liaison Officer ( ILO ) . 



Action : University C , University D and University E to advise <the executive committee> by Wed 5th July of 
suitable studentship candidates . 



Iction : Jones to write to CyberForce members and inform them of the appointment of the ILO . 



H 

IS Done | | | | Internet 



Fig. 3. Actions in a contrasting background 



3.3 Analysing Sentences for Semantic Relationships 

Issues are related in different ways - the relationship between two issues could be in 
containment i.e. an issue is a sub-issue of another. Actions could be related by the fact 
that they were made in the same meeting or, are being implemented by the same per- 
son. Still, they could be related by content i.e. they talk about the same thing. We can 
determine that two sentences talk about the same thing by comparing their content 
words (nouns, verbs, adjectives and adverbs). In examples (a) and (b) in section 3.2 
(Semantic (Deep) Analysis) above, the meaning of the two sentences is determined by 
the sets of words {site, XYZ, should, develop, test, cases, module, Al} and {site, 
XYZ, create, test, cases, module, Al } 

The relationship between these two sentences is calculated by comparing the fre- 
quencies of the sense of each word in the first set to the second set. We use the sense 
of each word deliberately because nearly every sentence in English can be written in 
two or more different ways. The sense is represented in the semantic category. Thus, 
the sets above translates to {M7, Z3c, S6+, A2.1, PI, A4.1, PI, Z2} and {M7, Z3c, 
VVI, A2.2, A4.1, PI, Z2) respectively. The correspondence between sentence (a) and 
sentence (b) is |A • B| / |B| or | A • B| / |A| whichever is the smallest, where A is the 
set of semantic values for sentence (a) and B is the set of semantic values for sentence 
(b). Thus the similarity between sentence (a) and sentence (b) is 0.75. The smallest 
value is selected to take into account the difference in lengths between the two sen- 
tences. For example, if set B has only one member, and that member is also a member 
of set A, then we would claim that the correspondence between (a) and (b) is 1.0. 
Such a claim would be inaccurate because of the difference in the length of the sen- 
tences. 

The correspondence values measure the semantic association between two sen- 
tences. As a result, it is possible to track and retrieve the relationships amongst issues 
and actions. The relationship is based on the content of a document, not it’s meta- 
data. Fig. 4 shows an interface for retrieving related information. 
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4 Representing Issues, Actions, and Relationships on a Database 

We developed a tool utilizing a simple relational database that ties the unstruc- 
tured minutes and the structured information identified through the methods 
discussed in sections above. The tokenisation and annotation technique, discussed 
in one of the previous sections, structures documents which maps onto a data- 
base structure. The database consists of issues, statements about the issues, actions, 
associations and agents. Issues, actions and minutes statements are bookmarked. 
Association between issues and actions are also represented on the database (Fig. 4. 
bottom frame). 

The database can also be used to provide services such as communication. The tool 
also integrates a communication service. The service periodically interrogates the 
database for new actions and posts them to agents. These communications are re- 
corded and could be tracked. 

In meetings, actions can be tracked by receiving and recording reports from agents 
or their representatives. Also, the outcomes of these actions can be recorded. This 
can be particularly important for providing rationales for actions and reasons for 
decisions taken. 



5 Tool Evaluation 



We tested the protocol (p2) against a set of twelve minutes taken from four different 
organisations. The organisations (A, B, C and D) are arranged along the first column 
(Table 1). The sets are numbered from 1 to 3 for each organisation. Organisation A 
minutes were used in developing the document template (Fig. 1.). In the table. Recall 
is the ratio of the number of relevant actions retrieved to the total number of relevant 
actions in the document. Precision is the ratio of the number of relevant actions re- 
trieved to the total number of relevant and irrelevant actions retrieved. Rel. (relevant) 
is the number of relevant actions in a minutes document. Ret. (retrieved) is the num- 
ber of actions returned by the protocol (p2) in section 3.2 (Semantic (Deep) Analysis). 
RelRet. (Relevant Retrieved) is the number of relevant actions retrieved. Thus if there 
are 20 actions in a minutes document and the protocol returns 16 items out of which 
10 are relevant then Precision = 50% and Recall = 62.5%. The average precisions for 
the four organisations are 85%, 27%, 56% and 80% respectively. While the average 
recalls are 80%, 43%, 95% and 95% respectively. 

Generally, the recall rate is good, an overall average of 78.4%. Precision stands at 
about 62.18% overall. This is attributed to different factors ranging from personalised 
styles of writing minutes to grammatical mistakes. In some of the minutes’ sets stud- 
ied, statements which are not actions linguistically were annotated as actions while 
some statements were disjointed and therefore could not properly form a complete 
sentence. Since our tool depends on linguistic annotator to overcome grammatical 
mistakes, there is a knock-on effect on our tool. 
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Table 1 . Decision Capture Tool’s Information Retrieval efficiency results 
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93. 33% 
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13 
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66. 67% 
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Average 










78.41% 


62. 18% 



We are also measuring the correspondence amongst the actions. However, since it 
uses the same principle as identifying actions, we are confident that the result will not 
vary significantly. We predict that the technique could also be applied to spoken dis- 
course but with reduced accuracy. Our preliminary results on the application of the 
technique on spoken discourse indeed confirm this prediction. 



6 Rendering Analysed Text 

In section 2 (Tokenisation and Linguistic Analysis), we argued for the separation of 
content and presentation. In this section, we discuss how to render the processed 
document. Extracted information can be rendered in two ways: as a separate list of 
actions or highlighted ‘in-text’ . As a separate list, the context is eliminated to enable 
people to concentrate on actions. For example, Fig. 4. shows a list of actions extracted 
from minutes' documents. This could be useful during reviews of previous actions in 
subsequent meetings. Other elements of minutes such as agenda and minute state- 
ments can also be viewed separately from the context. 

More importantly, actions can be viewed in text (Fig. 3.) thereby preserving the 
context. To help with readability, actions are shown in a contrasting background. A 
style sheet 9 is used for rendering the analysed document on a web browser. The style 
sheet is based on the minutes template described in section 2 (Tokenisation and Lin- 
guistic Annotation). The template applies different formatting styles to different parts 
of a document and shows action elements in a contrasting background. Using the web 
interface, it is possible to browse information on the database and jump directly to 
where the element appears in text. 

In Fig. 4., the left pane shows available functionalities and categories of informa- 
tion. The information is organized around the categories based on the agenda items 
and the dates on which the agenda items were discussed. Fig. 4. could be used for 
posting minutes to Tracker server. The tracker service runs at scheduled interval to 
process minutes document. This involves issue, action and minute statement identifi- 
cation, extraction and posting to a database. It also involves the calculation of rela- 
tionship amongst actions and issues. The calculated values are stored on the database. 



9 http://www. comp. lanes. ac.uk/computing/research/cseg/projects/Uacker/test.xsl 
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Fig. 4. Web interface for browsing decision elements 



The tool (Fig. 4.) can also be used in different situations: pre-, in- and post-meeting 
to provide services such as eliciting agenda items, adding agents and reviewing action 
to capture their outcomes. On the top right frame, details about the information cate- 
gory selected from the left pane are displayed. For example, Fig. 4. lists actions from 
the minutes’ of 21/06/2001. The bottom, right frame lists the related actions from the 
one selected in the top right frame. 10 The tool also supports a query facility. A user 
can start with a query which then returns the list in Fig. 4. Through the context link, it 
is possible to jump to the location where the text occurred. 



7 Related Work 

Few meeting capture tools are available off-the shelf and more research is ongoing. 
Different dimensions of decision capture are under investigation; some emphasise 
capture, others representation. Still others emphasise retrieval. But all the three issues 
are intertwined to the extent that they cannot be separately addressed. The area of 
capture seems to be widely researched as evident in tools [3, 4], Flowever, the areas of 
representation and retrieval remain a great challenge [1]. It is our view that more re- 
search into techniques for unobtrusive structuring is needed. 



8 Conclusion 

In this paper, we have demonstrated language resources and tools for capturing ac- 
tions and issues from minutes to support system engineering process. The use of lan- 
guage resources and tools has enabled us to extract actions from minutes documents. 
Although this is useful on its own, we do not regard this as a standalone technique but 



10 for the purposes of anonymity, some information is blanked out in Fig. 4. 
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one to be used in conjunction with audio and video minutes. It forms part of a 
broader goal to create tools to facilitate the tracking and re-use of decisions. 

We have also noticed that precision and recall are inversely proportional. Improv- 
ing one impacts negatively on the other. We think that information is more difficult to 
find than to identify, thus we have worked to improve recall. 

Acknowlegement. The Tracker project is supported under the EPSRC Systems Inte- 
gration programme in the UK, project number GR/R12183/01. 
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Abstract. Although Use Case driven analysis has been widely used in require- 
ments analysis, it does not facilitate effective requirements elicitation or provide 
rationale for the various artifacts that get generated. On the other hand, goal 
and scenario based approach is considered to be effective for elicitation but it 
does not lead to use cases. This paper discusses how to combine goal and sce- 
nario based requirements elicitation technique with use case driven analysis 
using natural language concepts. In our proposed approach, four levels of 
goals, scenario authoring rules, and linguistic techniques have been developed 
to identify use cases from text based goal and scenario descriptions. The arti- 
facts resulting from our approach could be used as input to a use case diagram- 
ming tool to automate the process of use case diagram generation. 



1 Introduction 

Use case driven analysis (UCDA) has been one of the most popular analysis methods 
in Requirements engineering. Use case driven analysis helps to cope with the com- 
plexity of the requirements analysis process. By identifying and then independently 
analyzing different use cases, we may focus on one narrow aspect of the system usage 
at a time. Since the idea of UCDA is simple, and the use case descriptions are based 
on natural concepts that can be found in the problem domain, the customers and the 
end users can actively participate in requirements analysis. Consequently, developers 
can learn more about the potential users, their actual needs, and their typical behavior 
[1], However, the lack of support for a systematic requirements elicitation process is 
probably one of the main drawbacks of UCDA. This lack of elicitation guidance in 
UCDA sometimes results in an ad hoc set of use cases without any underlying ration- 
ale. On the other hand, if one knows the origin of each use case, one could capture 
requirements through UCDA more completely. Our current research addresses the 
issue of the lack of elicitation support in UCDA by using goal and scenario modeling. 
Thus, the objective of this paper is to develop an approach for use case analysis that 
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makes use of goal and scenario authoring techniques using natural language process- 
ing concepts. 

Goal modeling is an effective way to identify requirements [2] [11]. The main em- 
phasis of goal driven approaches is that the rationale for developing a system is to be 
found outside the system itself - in the enterprise in which the system shall function 
[3] [4], A goal provides a rationale for requirements, i.e., a requirement exists because 
of some underlying goal, which provides the basis for it [11] [12] [13]. Recently, 
some proposals have been made to integrate goals and scenarios together. The impe- 
tus for this is the notion that by capturing examples and illustrations, scenarios can 
help people in reasoning about complex systems [5]. 

One of the early proposals that combine a use case with other concepts is that of 
Cockburn [6]. It suggests the use of goals to structure use cases by connecting every 
action in a scenario to a goal. However, Cockburn’ s approach is just concerned with 
the description of scenarios in a use case. Ralyte [7] integrated scenario based tech- 
niques into existing methods. This led to some enhancement of use case modeling 
within the OOSE method. However, Ralyte’ s approach does not provide any rationale 
for identifying use cases, i.e., it cannot reflect where the use cases come from. Finally, 
these approaches do not support both requirements elicitation and requirements analy- 
sis. Little is known about supporting the elicitation process in UCDA. 

In this paper, we present an approach that supports deriving use cases from goal 
modeling through authoring scenarios. Especially, a linguistics-based technique for 
scenario authoring, and goal modeling with different levels is proposed to comple- 
ment the elicitation process within UCDA. Our aim is to provide a supplementary 
elicitation process, which helps a human engineer to analyze the system with UCDA 
through goal and scenario modeling. Figure 1 depicts an overview of our approach. 




Fig. 1 . Overview of our approach 



Our approach consists of the following two main activities: a) goals and scenarios 
are generated for each goal and scenario level, namely, Business, Service, Interaction, 
and Internal, and b) conversion rules are used to transform the scenarios into a use 
case diagram. When scenarios achieving a goal are authored, they are generated using 
scenario authoring rules outlined in this paper. Goal and scenario authoring activity is 
highly iterative that results in a successively refined set of goals and scenarios. The 
output of the higher level becomes input to the immediate lower level, thus yielding a 
hierarchy of goals and scenarios with various levels. The following three characteris- 
tics exemplify our approach and contribute to the objective of systematically gener- 
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ating use case diagrams from goal and scenario modeling. Essentially, our approach 
bridges the gap between the goal and scenario modeling research and the traditional 
use case driven analysis within requirements engineering. 

The first characteristic of our approach is the abstraction level of goal and scenario 
(AL). Based on [8] [9], we classify <Goal, Scenario> pairs into business, service, 
system interaction, and system internal levels. As a result, the goals and scenarios are 
organized in a four level abstraction hierarchy. This hierarchy-based approach helps 
separate concerns in requirements elicitation and also refining a goal. 

The second characteristic is the scenario-authoring rule. Scenarios are very familiar 
to users. However, scenario authoring is ambiguous to users and it is not easy to ar- 
ticulate requirements through scenarios because there are no rules to author scenarios. 
Scenario authoring rules help elicit requirements from goals and they could be very 
useful to users and developers alike. 

The third characteristic of our approach is the concept of conversion rules for map- 
ping goal and scenario to actor and use case. These rules guide the user to elicit actors 
and use cases. There are two key notions embodied in the conversion rules. The first 
idea is that each goal at the interaction level is mapped to a use case because the sce- 
narios, which achieve the goal at this level, are represented as 'the interactions be- 
tween the system and its agents. This definition is similar to use case’s definition. The 
second idea is that each scenario achieving a goal at this level describes the flow of 
events in a use case. Therefore, this idea signifies that the rationale for use cases is 
founded on goals, which are derived through scenarios. 

The remainder of the paper is structured as follows. The abstraction levels for goal 
and scenario modeling are described in the next section. Goal and scenario modeling 
based on linguistic techniques is presented in Section 3. Section 4 deals with use case 
conversion rules. The concluding section sums up the essential properties of our ap- 
proach and the contributions of this paper. 



2 Four Abstraction Levels of Goal and Scenario 

Four abstraction levels of goal and scenarios (AL) help separate concerns in require- 
ments elicitation. Prior research has proved the usefulness of multiple levels of ab- 
stractions by applying the approach to the ELEKTRA real world case [10]. In this 
paper, we propose a four level abstraction hierarchy, organized as business, service, 
interaction, and internal level. Goal modeling is accompanied by scenarios corre- 
sponding to each of the abstraction levels. A goal is created at each level and scenar- 
ios are generated to achieve the goal. This is a convenient way to elicit requirements 
through goal modeling because these levels make it possible to refine the goals [8] 
[11]. Four abstraction levels of goal and scenario modeling are discussed below. 



2.1 Business Level 

The aim of the business level is to identify the ultimate purpose of a system. At this 
level, the overall system goal is specified by the organization or a particular user. For 
example, the business goal ‘Improve the services provided to our bank customers’ is 
an overall goal set up by the banking organization. 
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2.2 Service Level 

The aim of the service level is to identify the services that a system should provide to 
an organization and their rationale. At this level, several alternative architectures of 
services are postulated and evaluated. All of them correspond to a given business 
goal. Goals and scenarios in service level are represented as a pair <G, Sc> where G 
is a design goal and Sc is a service scenario. A design goal expresses one possible 
manner of fulfilling the business goal. For example, the design goal ‘Cash withdraw’ 
is one possible way of satisfying the business goal. A service scenario describes the 
flow of services among agents, which are necessary to fulfill the design goal. For 
example, the service scenario ‘The customer withdraws cash from the ATM’ imple- 
ments the design goal ‘Cash withdraw’. 



2.3 Interaction Level 

At the system interaction level the focus is on the interactions between the system and 
its agents. Goals and scenarios at this level are represented as a pair <G, Sc> where G 
is a service goal and Sc is an interaction scenario. These interactions are required to 
achieve the services assigned to the system at the service level. The service goal 
‘Withdraw cash from ATM’ expresses a manner of providing a service. Interaction 
scenario describes the flow of interactions between the system and agents. For exam- 
ple, the interaction scenario ‘The ATM receives the amount from user’ implements 
the service goal. 



2.4 Internal Level 

The internal level focuses on what the system needs to perform the interactions se- 
lected at the system interaction level. The ‘what’ is expressed in terms of internal 
system actions that involve system objects but may require external objects such as 
other systems. At this level, goals and scenarios are represented as a pair <G, Sc> 
where G is an interaction goal and Sc is an internal scenario. For example, ‘Deliver 
cash to the user’ is an interaction goal. The associated internal scenario describes the 
flow of interactions among the system objects to fulfill the interaction goal. 



3 Goal and Scenario Modeling Using Linguistics 

This section discusses goal and scenario modeling specific to each abstraction level 
using linguistic concepts. The notions of goal and scenario are briefly described. 
Then, the scenario-authoring rules are discussed. 



3.1 The Concept of Goal and Scenario 

A goal is defined as “something that some stakeholder hopes to achieve in the future” 
[4] [8] [1 1 ]. A scenario is “a possible behavior limited to a set of purposeful interac- 
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tions taking place among several agents” [14], The scenarios capture real require- 
ments since they describe real situations or concrete behaviors, and goals can be 
achieved through the execution of scenarios. Thus, scenarios have their goals, and 
typically, goals are achieved by scenarios. In other words, just as goals can help in 
scenario discovery, scenarios can also help in goal discovery. As each individual goal 
is discovered, a scenario can be authored for it. Once a scenario has been authored, it 
can be explored to yield further goals [3] [11]. 



3.2 Scenario Authoring Model 

As stated earlier, one can think of scenario-authoring rules as a set of guidelines that 
help generate scenarios. They are based on the linguistic techniques. First, we briefly 
discuss the scenario structure and then the scenario authoring rules. Our scenario 
structure is an extension of Rolland’s approach [8] and due to space limitation we do 
not provide an extensive discussion on scenario structure from [8]. 

3.2.1 The Scenario Structure 

Since a goal is intentional and a scenario is operational by nature, a goal is achieved 
by one or more scenarios. Scenario structure is a template, which enables us to de- 
scribe scenarios for the goal. Figure 2 shows our scenario structure is composed 
of several components that are analogous to parts of speech, i.e., elements of a sen- 
tence. 




Fig. 2. The scenario structure 



A scenario is associated with a subject, a verb and one or more parameters (multi- 
plicity is shown by a black dot). It is expressed as a clause with a subject, a main verb, 
and several parameters, where each parameter plays a different role with respect to 
the verb. There are two types of parameters (shown in the dash boxes). The main 
components of the scenario structure are described with an ATM example. 

The agent (Agent) is responsible for a given goal and implements an activity as an 
active entity. An actor or the system in the sentence can be the agent: for example in 
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the scenario, '(The user) Aent inserts card into ATM’ . The activity (Act) is the main 
verb. It shows one step in the sequence of scenarios. The object (Obj) is conceptual or 
a physical entity which is changed by activities: for example, ‘ The Customer deposits 
(cash) obj to the ATM’ . An object has several properties such as ownership, place, 
status, amount, etc., which can be changed by activities. The direction (Dir) is an 
entity interacting with the agent. The two types of directions, namely source (So) and 
destination (Dest) identify the initial and final location of objects to be communicated 
with, respectively. For example, consider the two scenarios given below: 

‘ The bank customer withdraws cash from the ATM (So ’ , 

'The ATM reports cash transactions to the Bank {Dest) ’ 

In the first scenario, the source of cash is the ATM, and in the second, the Bank is 
the destination of the cash transactions. 

3.2.2 Scenario Authoring Rules 

Based on the structure of scenario authoring model, we propose the following sce- 
nario-authoring rules. We have developed a domain analysis approach to identify 
common authoring patterns and their constraints. The formalization of these patterns 
results in the current set of authoring rules. 

In this section, each rule is introduced using the following template <Definition, 
Comment, Examplex The definition explains the contents of the rule. The comment is 
expressed as items to be considered when applying the rule. The example component 
shows a representative example (we show an example from the ATM domain). 

Scenario authoring rule 1 (SI) 

Definition: 

All scenarios should be authored using the following format: 

'Subject: Agent + Verb + Target:Object + Direction:(Source, Destination)’ 

Comment: 

The expected scenario prose is a description of a single course of action. This 
course of action should be an illustration of fulfillment of your goal. You should de- 
scribe the course of actions you expect, not the actions that are not expected, impossi- 
ble, and not relevant with regard to the problem domain. 

Example: 

(The customer)^, (deposits)^ (cash) oi . (to the ATM)„ ea 

Scenario authoring rule 2 (S2) 

Definition: 

‘Subject’ should be filled with an Agent. 

Comment: 

The agent has the responsibility to fulfill a goal and it may be a human or machine, 
for example, the designed system itself, an object, or a user. 

Example: 

(ATM) sends a prompt for code to the user 

Scenario authoring rule 3 (S3) 

Definition: 

‘Verb’ should include the properties stated at requirements levels. 
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Comment: 

‘ Verb ’ should express either the service at service level or the interaction at the in- 
teraction level as a transitive verb. 

Example: 

The customer (withdraws)^ cash from the ATM (service scenario) 

The ATM (displays)^ a prompt for amount to the user (interaction scenario) 

Scenario authoring rule 4 (S4) 

Definition: 

‘ Target ’ should be an object. 

Comment: 

The object is a conceptual or physical entity. It can be changed by a ‘Verb’. The 
change to an object may happen with one or more of its properties such as ownership, 
place, status, and amount. 

Example: 

The ATM delivers (the cash) ob . to the user 

Scenario authoring rule 5 (S5) 

Definition: 

‘Direction’ should be either source or destination 
Comment: 

The two types of directions, namely source and destination identify the origin and 
destination objects for communication. The source is the starting point ( object ) of the 
communication and the destination is the ending point of the communication (object). 
Sometimes, the source has preposition such as ‘from’, ‘in’, ‘out of’, etc. The destina- 
tion has preposition such as ‘to’, ‘toward’, ‘into’, etc. 

Example: 

The bank customer withdraws cash (from the ATM) & 

Scenario authoring rule 6 (S6) 

Definition: 

The designed system and the other agents are used exclusively in instantiating the 
Subject and Direction constructs. 

Comment: 

If the system is used to fill the subject slot, the other agents such as human, ma- 
chine, or an external system should be used to fill the direction slot. Thus, the other 
agents interacting with the system should be treated as candidate actors. 

Example: 

(The bank customer) A withdraws cash (from the ATM) Jo 

Table 1 and figure 3 show goals and scenarios created for the various requirements 
levels and the scenario authoring model for the ATM example. A scenario consists 
of actions and states and the flow of actions shows the system’s behavior. The states 
show necessary conditions for the actions to be fired. In general, a state can become 
a goal at the lower level (i.e., internal level), and the scenarios corresponding to 
that goal describe the state in more detail. For example, the state ‘If the card is 
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valid’ becomes the internal goal ‘Check the validity of Card’, with corresponding 
scenarios. 

In Figure 3, goals are represented by solid rectangles and the arrows show the 
relationship among goals. The arrow with dotted line shows the refinement relation- 
ship, i.e., the child goals achieve the same parent goal. The bidirectional arrows 
with solid line connect the sub-goals that “co-achieve the parent goal”. For example, 
there is ‘co-achieve’ relationship between gl.l and gl.2, as they achieve ‘gl’ to- 
gether. 

Table 1. ATM example of rule SI ~ S6 

Scenarios 

1 . The customer withdraws cash from the ATM 

2. The ATM reports cash transactions to the bank 

1. The ATM receives a card from user 
If the card is valid, then (state) 

2. ATM sends a prompt for code to the user 

3. The ATM receives the code from user 
If the code is valid, then (state) 

4. The ATM displays a prompt for amount to the user 

5. The ATM receives the amount from user 
If the amount is valid, then (state) 

6. The ATM ejects the card to the user 
If the user asked the ATM to supply a receipt, then (state) 

7. The ATM ejects the printed receipt to the user 

8. The ATM delivers the cash to the user 







Fig. 3. Partial goal hierarchy with the abstraction levels for ATM example 
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4 Use Case Conversion 

In Section 3.2, a scenario authoring model was proposed to improve the elicitation of 
the requirements through goals. This section describes how the elicited requirements 
are used to model use cases. It also shows a way to overcome the lack of elicitation 
support within UCDA. For use case identification, we propose a relationship model in 
conjunction with use case conversion rules from goals and scenarios. 

The core idea is that the goals and scenarios at the interaction level are used to help 
construct use cases. A goal at the interaction level is achieved by scenarios and a use 
case contains the set of possible scenarios to achieve that goal. This is due to the fact 
that, in our approach, the interaction level focuses on the interactions between the 
system and its agents. The purpose of the use case specification is to describe how the 
agent can interact with the system to get the service and achieve his/her goal. There- 
fore, we propose the following relationship diagram (shown in figure 4) that captures 
the association between the agents, goals, scenarios and use cases. 




Fig. 4. Relationships between goal and other artifacts 



After analyzing each use case, it is augmented with internal activity elements of 
software-to-be. In our approach, goal and scenario at the internal level represent the 
internal behavior of the system under consideration to achieve the goals of the inter- 
action level. Accordingly, use cases can be performed by achieving goals at the inter- 
nal level, and completed by integration of scenarios at that level. 

We also propose several guiding rules for the use case conversion using the same 
template as scenario authoring rules. We restrict the example to the service level goal 
(e.g: gl of the ATM example). 

Conversion guiding rule 1 (Cl) 

Definition: 

Goals listed at the interaction level become use cases in accordance with figure 4. 





168 



J. Kim, S. Park, and V. Sugumaran 



Comment: 

As mentioned above, goals at interaction level are mapped to use cases. The use 
cases are named after the goals they correspond to. 

Example: 

For the goal, gl, goals at the interaction level for ATM example are as follows: 
‘Withdraw cash from ATM’, 

‘Report the transaction ’ 

These goals become use cases with appropriate descriptions. 



Conversion guiding rule 2 (C2) 

Definition: 

Agents included in scenarios within a goal and wanting to achieve a goal become 
primary actors. 

Comment: 

A goal is achieved by scenarios and several agents may be found in scenarios. As 
discussed in scenario authoring rules, agents are used to instantiate either subject or 
direction objects. Therefore, agents in subject or direction are all treated as actors 
except the system that is currently being designed. 

Example: 

Figure 5 shows the actors that are found in scenarios within goal gl.l (withdraw 
cash from ATM) of the ATM example. Sc 1.1 has eight actions. All agents corre- 
sponding to the direction in all the actions are described as ‘user’. Thus, in case of 
Sc 1.1, ‘user’ becomes an actor. 




derived 



Agent: user 



user 

by rule S6 



gl.l: Withdraw cash from ATM I 1 

1 <Subject, Verb, Target, Direction.'. user> 

Scl.l : 

1 .The ATM receives a card from user 
If the card is valid, then (state) 

2. ATM sends a prompt for code to the user 

3. The ATM receives the code from user 
If the code is valid, then (state) 

4. The ATM displays a prompt for amount to the user 

5 . The ATM receives the amount from user 
If the amount is valid, then (state) 

6. The ATM ejects the card to the user 
If the user asked the A TM to supply a receipt, then 
(state) 

7. The ATM ejects the printed receipt to the user 

8. The ATM delivers the cash to the user 





Fig. 5. ATM example of finding actors through scenario structure 



Conversion guiding rule 3 (C3) 

Definition: 

The States contained in scenarios at the Interaction Level are mapped to internal 
goals at the Internal Level 
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Comment: 

The state represents necessary conditions for actions to be fired. They can be repre- 
sented as internal goals at the internal level. Thus, when specifying a use case, we can 
find more details about states in a goal and scenario at the internal level. 

Example: 

The state, ‘If the card is valid’, of Scl.l is described in the sub goal (gl.1.1), 
‘Check the validity of card’ at internal level. The scenarios corresponding to gl.1.1 
will also show more details about the state ‘If the card is valid’ . 

Conversion guiding rule 4 (C4) 

Definition: 

If goals at internal level have more than two parent goals, they become another use 
case with the <include> relationship. 

Comment: 

Because include relationship helps us identify commonality, a goal at internal level 
with more than two parent goals becomes a use case with <include> relationship. 

Example: 

In figure 3, the goal, ‘Check the validity of card’ , has two parent goals, namely, 
‘Withdraw cash from ATM’ and ‘Deposit cash into ATM’. Therefore, the goal, 
‘Check the validity of card’ , can become a use case with <include> relationship. 

Figure 6 shows the use case diagram for ATM application generated by applying 
the conversion rules Cl through C4. It contains two actors, namely, User and Bank 
and six use cases. The use cases associated with the user are Withdraw, Deposit, 
Transfer and Inquiry. These use cases include the “Check Validity ” use case, which is 
associated with the bank along with the Report use case. 




5 Conclusion 

Our proposed approach overcomes the lack of support for the elicitation process in 
UCDA and the underlying rationale. It builds on the following two principles: a) both 
goal modeling and scenario authoring help to cope with the elicitation process in 
UCDA, and b) use case conversion guidelines help us generate use case diagrams 
from the output of goal and scenario modeling. There have been some approaches to 







170 



J. Kim, S. Park, and V. Sugumaran 



create use case diagrams from requirements using some natural language processing 
techniques [7], However, due to the varying degrees of details in the requirements 
text, in many cases, the resulting use case diagrams were found to be practically un- 
usable. In contrast, our proposed approach produces goal descriptions with the same 
level of specificity, and can be used as adequate input for deriving use case diagrams 
(potentially automated using a diagramming tool). This promotes customer communi- 
cation in defining system requirements. Our future work includes further refinement 
of the scenario structure as well as the authoring and conversion rules. A proof of 
concept prototype is currently under development and an empirical validation of the 
approach is planned for the near future. 
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Abstract. The quest for improving retrieval performance has led to 
the deployment of larger syntactical units than just plain words. This 
article presents a retrieval experiment that compares the effectiveness 
of two unsupervised language models which generate terms that exceed 
the word boundary. In particular, this article tries to show that index 
expressions provide, beside their navigational properties, a good way to 
capture the semantics of inter-word relations and by doing so, form an 
adequate base for information retrieval applications. 



1 Introduction 

The success of single-word content descriptors in document retrieval systems 
is both astonishing and comprehensible. Single- word descriptors are expressive, 
have a concise meaning and are easy to find 1 . This explains the success of word 
based retrieval systems. Even nowadays, modern internet search engines like 
Google use complicated ranking systems and provide boolean query formulation, 
yet are in principle still word based. 

The employment of larger syntactical units than merely words for Informa- 
tion Retrieval purposes started in the late sixties [1], but still do not seem to 
yield the expected success. There are several non-trivial problems which need to 
be solved in order to effectively make use of multi-word descriptors: 

— the introduction of multi-word descriptors boosts precision, but hurts recall. 

— the manner of weighting is not obvious, especially in comparison to single- 
word descriptors which react suitably to standard statistically motivated 
weighting schemes (such as term frequency/inverse document frequency). 

— it is not easy to find distant, semantically related, multi-word descriptors. 

The great success of the present statistical techniques combined with such “shal- 
low linguistic techniques” [2] has compelled the idea that deep linguistics is uot 
worth the effort. However, advancements in natural language processing, and 
the ability to automatically detect related words [3,4] justifies reevaluation. 

This article attempts to compare the effectiveness of several language models 
capable of the unsupervised generation of multi-word descriptors. A comparison 
is made between standard single-word retrieval results, word n-grams and index 
expressions. 

1 This might be true for the English language, but for some Asian languages (for 
example Chinese and Vietnamese) the picture is less clear 
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2 Method 

2.1 Measuring Retrieval Performance 

To compare the linguistic models we use standard precision figures measured on 
11 different recall values ranging from 0.0 to 1.0, and on the 3 recall values 0.2, 
0.5 and 0.8. Subsequently these values are averaged over all queries. 



SMART and bright. The SMART system, developed by Salton [5], played a sig- 
nificant role in experimental Information Retrieval research. This vector space 
based tool offers the capability to measure and compare the effect of various 
weighting schemes and elementary linguistic techniques, such as stopword re- 
moval and stemming. 

It became apparent that extending SMART to the specific needs of modern 
Information Retrieval research would be rather challenging. The lack of docu- 
mentation and the style of coding complicates the extension of the system in 
non-trivial ways. These arguments invoked the decision to redesign this valuable 
system, preserving its semantic behavior, but written using modern extendible 
object oriented methods. The resulting system, bright, has been used in the 
retrieval experiments presented in this article. 



Inside bright. In contrast to SMART, the bright system consists of two distin- 
guishable components: the collection specific parser and the retrieval engine. The 
communication between the constituents is realized by an intermediate statistical 
collection representation, smart’s capability to specify the input structure, and 
thus parameterizing the global parser, has been eliminated. Though resulting in 
the construction of a parser for each new collection 2 , it provides the flexibility 
of testing elaborated linguistic techniques. 

The architecture of bright is shown in Figure 1. 



Test collections. The principal test collection used in this article is the Cran- 
fielcl test collection [1], a small standard collection of 1398 technical scientific 
abstracts 3 . The collection is accompanied by a rather large set of 225 queries 
along with human assessed relevance judgments. It consists of approximately 
14,000 lines of text, and contains almost 250,000 words of which 15,000 unique. 

To show that the approach presented is feasible, we tested our findings on the 
Associated Press newswire collection, part of the TREC dataset. This collection 
is approximately 800Mb big, containing 250,000 documents and 50 queries. It 
consists of more than 100,000,000 words of which 300,000 unique. 

2 Thanks to the object oriented structure of existing bright parsers, a parser rewrite 
is relatively easy. 

3 The abstracts are numbered 1 to 1400. Abstracts 471 and 995 are missing. 
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Documents 

Queries 

Judgements 




Retrieval results 



Collection Specific Parser 



Retrieval Engine 



Fig. 1. The bright architecture 



Baseline. The retrieval results of the distinct models will be compared to the 
standard multiset model, without the use of a special weighting scheme (simply 
cosine normalization). This baseline will be referred to as nnc equivalent to 
smart’s notation for this particular weighting. The justification for not using 
more elaborated weighting methods is twofold: 

— statistically motivated weighting schemes may mask the linguistic issues 

— the purpose of the experiment is to compare different models, not to maxi- 
mize (tune) retrieval results 

Although one of the language models outperforms term frequency/inverse doc- 
ument frequency weighting (ate), this is of less importance regarding the scope 
of this article. 

2.2 Beyond the Word Boundary 

A key issue in Information Retrieval is to find an efficient and effective mechanism 
to automatically derive a document representation that describes its contents. 
The most successful approach thus far is to employ statistics of individual words, 
ignoring all of the structure within the document (the multiset model) . Obviously 
indexing is not necessarily limited to words. The use of larger (syntactical) units 
has been the subject of research for many years. The benefit is clear: larger 
units allow more detailed (specific) indexes and are a way to raise precision. On 
the other hand, the rare occurrences of these units will hurt recall. We describe 
two indexing models that exceed the word boundary, namely word n-grams and 
index expressions and compare their retrieval performance using bright 



Word n-grams. The word n-gram model tries to capture inter-word relations 
by simply denoting the words as ordered pairs, triples etc. In effect, the n-gram 
model extends the multiset model with sequences of (at most n) consecutive 
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words in the order which they appear in the text. Consider the following docu- 
ment excerpt: 

An experimental study of a wing in a propeller slipstream was made in 
order to determine the spanwise distribution of the lift increase due to 
slipstream at different angles of attack of the wing and at different free 
stream to slipstream velocity ratios. 

The 2-gram model will add, besides each word individually, the descriptor ’ pro- 
peller slipstream ’ which is obviously meaningful. The model is rather imprecise 
however, since adding the descriptor ’ and at' will probably not contribute to 
retrieval performance. Some researchers therefore only add n-grams consisting 
of non-stopwords, or consider an n-gram only worthwhile if it has a (fixed) fre- 
quency in the collection. 



Index expressions. As already shown before, simply using sequences of words 
for indexing purposes has some drawbacks: 

— Sequential words are not necessarily semantically related. 

— Sometimes words are semantically related, but are not sequential. 

It seems plausible to look for combinations of words that are semantically related. 
In [3] an algorithm is presented which is capable of finding relations between 
words in natural language text. These relations form a hierarchical structure 
that is represented by index expressions. 

Index expressions extend term phrases which model the relationships be- 
tween terms. In this light, index expressions can be seen as an approximation 
of the rich concept of noun (and verb) phrases. Their philosophical basis stems 
from Farradane’s relational indexing [6,7]. Farradane projected the idea that a 
considerable amount of the meaning in information objects is denoted in the 
relationships between the terms. 



Language of index expressions. Let T be a set of terms and C a set of 
connectors. The language of index expressions is defined over the alphabet £ = 
TUCU{(,)} using structural induction: 

(i) t is an index expression (for t GT). 

(ii) ei o c(e 2 ) is an index expression (for index expressions ei,e 2 and cG C). 

In this definition, the o operator denotes string concatenation. If there are no 
means for confusion, we omit the parentheses when writing down index expres- 
sions. 

The structural properties of these expressions provide special opportunities 
to support a searcher in formulating their information need in terms of a (in- 
formation system dependent) query. The resulting mechanism is called Query 
by Navigation [8]. In [9] this mechanism is described from a semantical point 
of view. By employing the relation between terms and documents, concepts are 
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describe 




the fin lifting 




tip minimum 
drag 

Fig. 2. Tree structure of example sentence. 



derived which are used as navigational pivots during Query by Navigation. In- 
dex expressions have been motivated and validated of their potential to support 
interactive query formulation, without assuming that the searcher is familiar 
with the collection. The rationale of the approach is that a searcher may not be 
able to formulate the information need, but is well capable of recognizing the 
relevance of a formulation. 

Consider the following input sentence: 

The report describes an investigation into the design of minimum drag 
tip fins by lifting line theory. 

The corresponding parsed index expression is: 

describe SUB (report IS (the)) OBJ (investigation IS (an) INTO 
(design IS (the) OF (fin IS (tip IS (drag)) IS (minimum)))) BY 
(theory IS (line IS (lifting))) 

whose structure is visualized in figure 2. Note that in this index expression, 
the verb-subject and verb-object relations are represented by the SUB and OBJ 
connectors, while apposition is represented by the IS connector. Using this index 
expression it is possible to generate subexpressions. Simply put, subexpressions 
of an index expression are like subtrees of the tree structure. Preceding a more 
formal definition, we will introduce power index expressions, a notion similar to 
power sets. 

Power index expressions. Let e = t o* =1 c,e, be an index expression. The set 
A(e) of lead expressions belonging to e is defined as follows: 

m = U tojL^te)) 64 

(bi,...,b fe )e{o,ip 
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The power index expression belonging to e, denoted by ’P(e), is the set 

k 

V(e) = A{e)\j\JV{ei) 

i=l 

Using this definition we can now formally define what a subexpression is: 
Subexpression. Let ei and e 2 be two index expressions, then: 

61 E £2 = ei € ’P( e 2) 

Among the subexpressions in our example sentence we find ‘describe BY 
theory’, clearly non-sequential words having a strong semantic relation. 

Instead of using all subexpressions as descriptors, we restrict ourselves to 
subexpressions having a maximum length. 4 In this article we evaluate the re- 
trieval performance for 2-index, 3-index and 4-index subexpressions. Note that a 
similar linguistic approach which creates head-modifier frames [10] is essentially 
a cutdown version of index expressions, while their unnesting into head-modifier 
pairs generates index expressions of length 2. 

3 Results 

3.1 Validation Results 

Baseline. The Cranfielcl baseline experiment yields the following results: 



scheme 


ll-pt average 


3-pt average 


nnc 


0.2363 (100.0%) 


0.2201 (100.0%) 



Word n-grams. We performed retrieval runs using bright on n-grams with 
1 < n < 4 and weighting scheme nnc. n = 1 produces the multiset model 
(baseline). Note that, for example, the run with n = 3 uses word sequences of 
length 3 and those smaller as semantical units. 



n 


units 


ll-pt average 


3-pt average 


T 


7223 


0.2363 (100.0%) 


0.2201 (100.0%) 


2 

3 

4 


79675 

230870 

422554 


0.2554 (108.1%) 
0.2519 (106.6%) 
0.2422 (102.5%) 


0.2401 (109.1%) 
0.2384 (108.3%) 
0.2273 (103.3%) 



The results of different n-gram runs are depicted in figure 3. It is easy to see 
that all n-gram runs perform better than the baseline. The best improvement 
is obtained in the high precision - low recall area, which is not surprising, since 
n-grams have a more specific meaning, but occur less frequently than words. The 
best results are obtained for n = 2. As anticipated, the retrieval performance 
decreases slightly when n is increased, because more ‘meaningless’ units are 
generated than ‘meaningful 1 units. 

The length of a index expression is the number of terms that occur in the expression. 



4 
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Fig. 3. n-grams compared 



Index expressions. As with n-grams, we use bright to measure retrieval 
performance for different maximum lengths of index expressions. Again, for n = 1 
the multiset model is produced which functions as baseline. The results are 
presented below and visualized in figure 4. 



n 


terminals 


11-pt average 


3-pt average 


i 


7223 


0.2363 (100.0%) 


0.2201 (100.0%) 


2 

3 

4 


68061 

206034 

429084 


0.2771 (117.3%) 
0.2645 (111.9%) 
0.2515 (106.4%) 


0.2635 (119.7%) 
0.2517 (114.4%) 
0.2353 (106.9%) 



The best results are obtained for n = 2. Obviously, long index expressions have 
high descriptive power, but are rare. So, similar to n-grams we notice the highest 
improvement in high precision - low recall area. Interesting is is that the 4-index 




Fig. 4. Index expressions 
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Fig. 5. Index expressions vs. n-grams 



starts off relatively good, but as soon precision drops under 0.4 it is almost 
indistinguishable from the baseline. 



Comparing n-grams with index expressions. Combining the results of the 
previous two sections we are capable of comparing the retrieval performance 
of index expressions with the performance of n-grams (see figure 5). 2-index 
outperforms 2-ngrams throughout the recall spectrum. The gain in performance 
achieved by 2-ngram is doubled by 2-index. This stresses the semantical validity 
of automatically generated index expressions. 

3.2 TREC Results 

We performed two retrieval runs on the Associated Press collection: a standard 
word based retrieval run (baseline) and the 2-index run. 



type 


11-pt average 


3-pt average 


word 


0.0272 (100.0%) 


0.0142 (100.0%) 


2-index 


0.0620 (227.9%) 


0.0380 (267.6%) 



The relatively low score for the baseline is primarily due to the absence of an 
elaborated weighting scheme. Nevertheless, the 2-index run (with the same sim- 
ple weighting scheme) scores significantly better. 

The resulting precision-recall data is shown in see figure 6. 



3.3 Weighting Index Expressions 

In the previous experiments we treated index expressions in the same manner 
as terms. Because index expressions often consist of more than one word it 
seems reasonable to give them a higher weight than simple (single word) terms. 



Effectiveness of Index Expressions 



179 




Fig. 6. Associated Press words vs. 2-index 



The following experiment compares the 11-pt average retrieval performance of 
index expressions for several weight factors. In figure 7 we show how the retrieval 
performance is effected by adjusting the weight factor of index expressions having 
length 2. The best retrieval performance is obtained using a weight factor of 
approximately 2. The minimal improvement of 5% for weight factor 0 might 
seem strange at first glance; one might expect a gain of 0%, since eliminating 
index expressions with length 2 leaves us with plain terms. However, there is a 
mild form of stemming in the index expression model which contributes to this 
small gain in retrieval performance. 

Studying the retrieval results of index expressions with length smaller or 
equal to 3, there are two changeable parameters; the weightfactor of index ex- 
pressions of length 2, and the weightfactor of index expressions of length 3. This 
results in the 3d plot depicted in figure 8. Again, the maximal performance is 




Fig. 7. Influence of weight factor 
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1 1 pt avg precision 



cranfield 



0.28 

0.27 

0.26 

0.25 

0.24 

0.23 

0.22 




10 



Fig. 8. Influence of weight factors 



obtained by doubling the weight of index expressions with length 2. For index 
expressions with length 3 the picture is vague. Apparently the weight factor (and 
the importance of these index expressions) is less obvious. According to the data, 
the maximal combined performance is for (2.3,3. 1). 

4 Conclusions 

As shown in this article, index expressions are suitable for capturing the seman- 
tics of inter-word relations. Experiments show that 2-indexes perform better 
than standard word-based retrieval runs, especially on the large TREC collec- 
tion where the retrieval performance is more than doubled. 

Compared to 2-grams, index expressions show an improvement of 10% on 
the small Canfield collection. Due to the enormous number of possible 2-grams 
in large collections, it was unfeasible to compare 2-grams and 2-indexes for the 
TREC collection. 

In situations where the structure of index expressions can be exploited (as in 
query by navigation) they seem to form a beneficial alternative to term based 
systems, which is validated in [11]. 
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